CN-116701979-B - Social network data analysis method and system based on limited k-means

CN116701979BCN 116701979 BCN116701979 BCN 116701979BCN-116701979-B

Abstract

The invention provides a social network data analysis method and a system based on a limited k-means, which relate to the technical field of social network data processing, wherein the method mainly considers the constraint of a must-connection in an initialization center selection stage, and after a first center is randomly selected, the rest clustering centers are selected through cyclic calculation of weight probability influenced by the constraint of the must-connection. And then, in the allocation step of the algorithm iteration stage, aiming at two data constraint types, adopting a strategy of preferentially processing disjoint non-connected sets, preferentially considering intersection of the disjoint non-connected sets and the non-connected sets, and classifying constraint points to achieve higher algorithm efficiency. The present disclosure solves the problem of inaccuracy in the clustering process for data processing subject to the must-be-connected constraint and the disjoint no-connect constraint.

Inventors

GUO LONGKUN
XUE RUIXIN
Jia Chaoqi

Assignees

齐鲁工业大学(山东省科学院)

Dates

Publication Date: 20260508
Application Date: 20230529

Claims (8)

1. The social network data analysis method based on the limited k-means is characterized by comprising the following steps of: When clustering is carried out, randomly selecting one data point as an initial clustering center, and then selecting other initial clustering centers by considering the condition that other data points are constrained by ML/DCL to form an initial clustering center set; the network data set is a discrete data set, and the ML set is a point set of a group of data Each of which is Are all a set of ML, given data points If (if) Then DCL sets as a set of points of a set of data Each of which is Is one satisfying Given a set of DCL data points If (if) And is also provided with m Then it is necessary to Wherein A m is one of k clusters after the data set P is clustered; The data points are distributed to the clusters where each initial cluster center in the initial cluster center set is located by adopting a minimum sum matching method, wherein the DCL set is preferentially processed and the condition that data in the DCL set is intersected with the ML set is considered, if the data points in the DCL set also belong to the ML set, the data points are determined to be calculated according to the weights, otherwise, the data points are directly used for calculation until the DCL set is distributed; The process of assigning data points to clusters of each initial cluster center in the initial cluster center set by adopting a minimum sum matching method comprises the steps of sequentially assigning the data points to the clusters of each center in the initial cluster center set, preferentially processing a DCL set, processing the first data point in the DCL set, and if the data points in the DCL set belong to an ML set, using the ML set as the weight of the ML set Centroid of I If the data points in the DCL set do not belong to the ML set, directly calculating the data points, calculating the initial cluster centers corresponding to the processed data points by a minimum sum matching method, enabling the total square distance sum to be minimum, and respectively distributing the data points to clusters where the corresponding initial cluster centers are located; And processing the rest data points until the processing is complete and an initial cluster set is obtained, updating a cluster center by adopting a mean value for each cluster, acquiring a new cluster center set, iteratively updating the cluster set by utilizing the new cluster center set, stopping iteration until smaller cost cannot be acquired, and classifying the data in the social network into a certain number of clusters.
2. The method for analyzing social network data based on the limited k-means as recited in claim 1, wherein one data point is randomly selected as an initial cluster center, other initial cluster centers are selected by considering the condition that other data points are constrained by ML/DCL, the process of forming the initial cluster center set is to circularly perform weighted calculation of the data points according to the condition that the data points are constrained by ML/DCL, select the next data point according to the weighted probability of the data points, if the data points are constrained by ML, take the centroid of the ML set to represent the ML set and add the ML set into the initial cluster center set, if the data points are constrained by DCL, the data points are directly used as a cluster center until the number of the initial cluster centers is selected to form the initial cluster center set.
3. The method for analyzing social network data based on a limited k-means according to claim 1, wherein the processing of remaining data points until the processing is complete and an initial cluster set is obtained is as follows: For the data points of the remaining ML set, if the data points belong to the ML set, the weight is | Centroid of I If not belonging to the ML set, directly distributing the data points to the nearest initial clustering center until all the data points are distributed to obtain initial clustering.
4. The method for analyzing social network data based on the limited k-means according to claim 1, wherein for each cluster, updating the cluster center by using the mean value, obtaining a new cluster center set, and iteratively updating the cluster set by using the new cluster center set, until a smaller cost cannot be obtained, stopping the iteration, wherein the steps are as follows: and for each initial cluster, updating the cluster center through a mean value method to obtain an updated cluster center set, iterating the updated cluster center set by using the iteratively updated cluster center set, judging whether the termination condition of the iterative update is reached or not by using the cost difference, if so, starting the next iteration until the smaller cost is not obtained any more, and ending the iteration.
5. The method of claim 1, wherein the ML constraint is a constraint that multiple data must belong to a set, and the DCL constraint is a constraint that multiple data must not belong to a set.
6. A social networking data analysis system based on a limited k-means, comprising: The data acquisition module is used for acquiring data in a social network, forming the data of the social network into a network data set, and giving the clustering number of data aggregation, an ML set and a DCL set, wherein the network data set is a discrete data set, and the ML set is a point set of a group of data Each of which is Are all a set of ML, given data points If (if) Then DCL sets as a set of points of a set of data Each of which is Is one satisfying Given a set of DCL data points If (if) And is also provided with m Then it is necessary to Wherein A m is one of k clusters after the data set P is clustered; the data clustering module is used for randomly selecting one data point as an initial clustering center when clustering is carried out, and selecting other initial clustering centers by considering the condition that other data points are constrained by ML/DCL to form an initial clustering center set; The method comprises the steps of adopting a minimum sum matching method to distribute data points to clusters where each initial cluster center in the initial cluster center set is located, preferentially processing a DCL set, considering the intersection condition of data in the DCL set and an ML set, if the data points in the DCL set belong to the ML set, determining the data points according to weights, otherwise, directly calculating the data points until the DCL set is distributed, adopting the minimum sum matching method to distribute the data points to the clusters where each initial cluster center in the initial cluster center set is located, sequentially distributing the data points to the clusters where each center in the initial cluster center set is located, preferentially processing the DCL set, processing the first data point in the DCL set, and if the data points in the DCL set belong to the ML set, using the weights of the ML set as # Centroid of I If the data points in the DCL set do not belong to the ML set, directly calculating the data points, calculating the initial cluster centers corresponding to the processed data points by a minimum sum matching method, enabling the total square distance sum to be minimum, and respectively distributing the data points to clusters where the corresponding initial cluster centers are located; and processing the rest data points until the processing is complete and an initial cluster set is obtained, updating a cluster center by adopting a mean value for each cluster, acquiring a new cluster center set, and iteratively updating the cluster set by utilizing the new cluster center set until the iteration is stopped when smaller cost cannot be acquired.
7. A non-transitory computer readable storage medium storing computer instructions which, when executed by a processor, implement the limited k-means based social network data analysis method of any of claims 1-5.
8. An electronic device comprising a processor, a memory, and a computer program, wherein the processor is coupled to the memory, the computer program being stored in the memory, the processor executing the computer program stored in the memory when the electronic device is operating to cause the electronic device to perform a social network data analysis method based on a limited k-means as claimed in any one of claims 1-5.

Description

Social network data analysis method and system based on limited k-means Technical Field The disclosure relates to the technical field of social network data processing, in particular to a social network data analysis method and system based on a limited k-means value. Background The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art. With the rapid development of information technology today, social networks store and accumulate a large amount of data, and the use of data mining tools to efficiently mine and analyze such data can allow people to obtain more valuable information from such data and a large amount of knowledge about the real world. These data are very important information supports for some relevant departments. The method also greatly promotes resource sharing among various departments and various industries, and continuously promotes social development for release, sharing and analysis of a large amount of data, so that a way of people to acquire information is more convenient. Large-scale datasets typically contain many personal data that needs to be protected, including node data, edge data (connections between individuals), and graph structure data. The cluster analysis algorithm based on the social network data analysis is required, the data analysis can be more accurately carried out on three types of data in the social network data, and a better effect is achieved in a data release preprocessing stage. The data analysis refers to analyzing a large amount of collected data by using a proper statistical analysis method, and concentrating and refining information hidden in a large amount of disordered data, so as to find out the internal rules of the researched objects, so as to develop the data maximally and play a role of data. Cluster analysis is a typical method of data analysis. The purpose of cluster analysis is to analyze whether the data belongs to individual clusters, such that members of one group are similar to each other and different from members of the other group. It analyzes a data set, the classified classification is unknown, and therefore, cluster analysis belongs to unsupervised learning. The current clustering problems include K-means, K-medians and the like, wherein the main methods are a K-means (K-means) algorithm and a K-center (K-means) algorithm, and the K-means algorithm proposed by Stuart Lloyd in 1957 is the most well known and widely used clustering algorithm at present. The classical k-means algorithm needs to randomly select k points in the data set as clustering centers in the initial stage, and the clustering effect and the running time of the k-means algorithm are greatly influenced by the selection of the initial clustering centers, if the selected initial clustering centers are not good, the obtained clustering result may be only a local optimal solution. The existing K-means++ algorithm improves the selection of the initial cluster centers, and the basic idea is that the mutual distance between the initial cluster centers is as far as possible, so that a certain probability is adopted to select the initial cluster centers. However, in actual data samples, there is tag information from the samples, and all samples are constrained. For example, users with similar features are assigned to the same class for data analysis to obtain valuable business information. However, even users with similar characteristics have a limited relationship, for example, on a public life sharing platform, two users pay attention to each other, which means that the users have very similar interests, so that the users can be clustered in the same cluster during data analysis, the users in the cluster can be pushed with the same interested content in the later stage, and two users can add each other into a blacklist, so that when the two users perform data analysis, in order to obtain higher efficiency, the users can not be considered to be clustered in the same cluster. In addition, when processing the data with the marking information, the performance and the precision of the clustering algorithm are required to be considered on the basis of meeting the constraint. Disclosure of Invention In order to solve the problems, the method and the system for analyzing the social network data based on the limited k-means are provided, two constraint types are introduced, the constrained k-means algorithm is enabled to be feasible by carrying out the constrained connection and the disjoint non-connected constraint in an initialization center selection stage, constraint points are classified and processed in an allocation step of an algorithm iteration stage, the problem that the prior art scheme is inaccurate in data processing under the constrained connection and the disjoint non-connected constraint in a clustering process is solved, and the method and the system have higher practicability. A