CN-121997026-A - Class unbalance analysis method based on data life cycle evolution

CN121997026ACN 121997026 ACN121997026 ACN 121997026ACN-121997026-A

Abstract

The invention provides a class unbalance analysis method based on data life cycle evolution, which comprises the steps of 1, constructing a full life cycle evolution model of a data sample, 2, carrying out panoramic scanning and liveness statistical analysis on dynamic indexes of structural response characteristics in different evolution levels, 3, monitoring interaction behaviors among different class samples in real time in a dynamic advancing process of the full life cycle evolution model, 4, constructing a quantity balance mechanism based on life cycle evolution abundance, 5, training a classification model by using a weighted objective function, and applying the classification model to a test set to realize accurate classification of class unbalance data. According to the method, the optimal fusion coefficient is automatically determined by a Bayesian optimization method taking Gaussian process regression as a proxy model, so that blindness of manual parameter adjustment is avoided, and the recognition rate and the overall classification performance of a few samples are remarkably improved.

Inventors

DAI WEI
ZHANG SHIJIE
NAN JING
ZHANG WEI
FAN RUIPENG
LIU XIN
ZHANG XIANGRUI

Assignees

中国矿业大学

Dates

Publication Date: 20260508
Application Date: 20260128

Claims (10)

1. The class unbalance analysis method based on the data life cycle evolution is characterized by comprising the following steps of: Step 1, constructing a full life cycle evolution model of a data sample, simulating a dynamic process of evolution of the sample from a discrete independent state to a complex structure form, extracting dynamic indexes representing structural response characteristics of the sample in different evolution levels, wherein the dynamic indexes comprise life cycle merging assimilation scales for describing that the sample loses individual independence and is merged into a local manifold, life cycle association closure scales for describing that a sample neighborhood interaction network forms a stable closure path, and life cycle multidimensional configuration derivative scales for describing that the sample participates in construction of a high-dimensional space topology framework and an envelope structure; Step 2, carrying out panoramic scanning and liveness statistical analysis on dynamic indexes of structural response characteristics in different evolution levels, identifying abnormal behaviors based on the duration of samples in the evolution process and structural contributions, and constructing a life cycle outlier suppression mechanism for identifying and weakening transient noise samples which are rapidly eliminated in the early stage of evolution and delaying weight contributions of free samples which participate in structural construction in the training process of a classification model in the late stage of evolution; step 3, monitoring interaction behaviors among samples of different types in real time in a dynamic advancing process of a full life cycle evolution model, identifying an early interaction structure comprising direct cross-class connection and high-order mixed entanglement, and constructing a life cycle class overlap suppression mechanism by calculating the evolution stage position and entanglement depth of the first trigger cross-class contact of the samples so as to suppress negative influence of the boundary fuzzy samples which show atypical cross-class affinity in early evolution on a classification decision surface; Step 4, constructing a quantity balance mechanism based on life cycle evolution abundance, fusing the quantity balance mechanism with a life cycle outlier suppression mechanism and a life cycle class overlap suppression mechanism in a convex combination mode, converting the search of an optimal fusion coefficient into a super-parameter optimization problem, and automatically optimizing by using a Bayesian process regression as a proxy model and adopting a Bayesian optimization algorithm of expected improvement criterion so as to determine a life cycle comprehensive weight vector capable of maximizing system evolution stability; And 5, constructing a life cycle comprehensive weight vector into a sample level diagonal weight matrix, embedding the sample level diagonal weight matrix into a loss function of a random configuration network and hidden layer node generation constraint conditions, training a classification model by using a weighted objective function, applying the classification model to a test set, realizing accurate classification of unbalanced class data, and outputting performance indexes including a geometric mean G-mean, a recall SENSITIVITY, F1 score F1-score and an accuracy Precision.
2. The method of claim 1, wherein step 1 comprises: step 1-1 defining an unbalanced data set Wherein Represent the first Of individual samples The dimension of the feature vector is determined, Represent the first The class labels of the individual samples are used, As a total number of samples, Representing real space, calculating heterogeneous interaction intensity matrix among samples Wherein the matrix elements Representing a sample And sample of Constructing time sequence along with evolution based on heterogeneous interaction intensity matrix Dynamic propulsion full life cycle evolution model If and only if the heterogeneous interaction strength between any two samples in the unbalanced data set is less than or equal to When the method is used, the active connection paths are established between samples and form a local symbiotic structural unit; Step 1-2, extracting life cycle merging and assimilating scale, establishing community evolution simulation mechanism, simulating the process of merging independent samples into local manifold, extracting life cycle merging and assimilating scale of samples All potential connections in the heterogeneous interaction strength matrix According to the structure affinity and hydrophobicity Sorting from small to large, traversing the sorted connections in turn, and for the connections If a sample is And Belonging to different independent evolution communities, performing community assimilation operation, defining samples Is integrated with the scale of life cycle The following are provided: , Wherein, the And For sample index, the two sample numbers corresponding to any one potential connection in the heterogeneous interaction intensity matrix are respectively corresponding to the first sample set Individual samples And the first Individual samples , Critical interaction intensity to cause structural fusion of independent communities; step 1-3, extracting life cycle associated closed scale, defining sample Evolution timing Under interaction neighborhood set When associated matrix Third order trace of (2) The method shows that the circulation paths which are connected end to end in the full life cycle evolution model are regarded as the formation of stable closure association, and the sample is defined Is a lifecycle related closure scale approximation of (a) For the sample Average minimum interaction strength forming closed loops within interaction neighborhood: , Wherein, the Representing cardinality of the interaction neighborhood set; step 14, extracting a life cycle multidimensional configuration derivative scale, and extracting the life cycle multidimensional configuration derivative scale of the sample by adopting a neighborhood Voronoi approximation method When a sample is The number of neighborhood samples satisfies the support condition When calculating the derivative scale of the high-dimensional configuration: , Wherein, the Amplifying the coefficient for a preset configuration if Definition of 。
3. The method according to claim 2, wherein step 2 comprises: step 2-1, determining a full life cycle evolution scanning interval, and determining an observation interval of an evolution time sequence based on average heterogeneous interaction intensity of a few samples and global life cycle merging assimilation scale mean value And in the observation interval Internally generated inclusion Multi-stage scan sequence of individual evolving timing nodes Wherein A starting heterogeneous interaction strength threshold representing a lifecycle evolution scan, A terminating heterogeneous interaction strength threshold representing a lifecycle evolution scan, Represent the first Heterogeneous interaction intensity threshold corresponding to each life cycle evolution stage, minority class sample corresponding class being minority class, the number of minority class samples being recorded as The corresponding class of the majority class samples is the majority class, and the number of the majority class samples is recorded as Wherein the minority is the category with smaller sample number in the current training data, and the majority is the category with larger sample number, namely ; Step 2-2, calculating the inter-stage average liveness weight, and observing a sample under an evolution time sequence Is used for calculating samples by using an exponential mapping function In the first place Cross-stage average weights over evolution dimensions: , Wherein, the Respectively and correspondingly merging assimilation, associated closure and configuration derivation three life cycle dimensions; for the sample A set of time indices exhibiting an effective evolution response; Is the first Global dynamic index mean value under dimension; Is a smoothing factor; step 2-3, introducing a category correction coefficient Adjusting the inter-stage average liveness weight to obtain a corrected life cycle weight Setting a correction coefficient of a few types of samples to be larger than that of a plurality of types of samples; Step 2-4, constructing an evolution hysteresis suppression logic, judging hysteresis behaviors according to life cycle response time sequences of samples in a multi-stage evolution scanning sequence, and keeping weights obtained in the steps 2-2 to 2-3 unchanged if the samples do not meet hysteresis judgment conditions; Step 2-5, constructing a self-adaptive cooperative coupling model of multidimensional life cycle signals, regarding dynamic weights of three dimensions of merging assimilation, associated closure and configuration derivation as independent evolution signal channels, and defining single life cycle outlier suppression weights The result of the cooperative modulation for the multichannel signal: , wherein the weight coefficient Searching and determining under simplex constraint by particle swarm optimization, taking verification indexes including a geometric mean G-mean, a recall ratio SENSITIVITY, F score F1-score and an accuracy rate Precision as an objective function, and updating by adopting three items of inertia, individuals and swarms; step 2-6, carrying out global normalization processing on the life cycle outlier suppression weights of all samples, and constructing a sample-level diagonal life cycle outlier suppression matrix Where diag represents the diagonal matrix, Is the first Outlier rejection weights for each sample.
4. A method according to claim 3, wherein step3 comprises the steps of: step 3-1, monitoring cross-class interactive structure in real time in evolution process, and when sample pairs from different classes are in full life cycle evolution model advancing process Evolution timing The following satisfies the heterogeneous interaction strength When it is determined that a direct cross-class connection path is established, the set is recorded as a set : , If three sample points Evolution timing The next two establish an active connection path and three sample points If the category labels of (1) are not completely consistent, determining that a high-order hybrid entanglement structure is formed, and marking the structure as a set : ; Step 3-2, evaluating the life cycle entanglement starting index and the basic purity confidence weight of the sample, tracking the sample The first time of participating in constructing critical evolution time sequence of cross-class interactive structure, defining the critical evolution time sequence as entanglement start finger Constructing basic purity confidence weights by adopting Sigmoid mapping function with displacement and scaling factors : , Wherein, the Is an entanglement boundary threshold; Exp represents a natural exponential function; Step 3-3, calculating the collision strength and density inhibition factor between classes; and 3-4, generating a life cycle class overlap suppression mechanism.
5. The method of claim 4, wherein step 3-3 comprises computing a sample Cross-class connectivity under specified evolution timing : , Wherein, the In order for the evolution model to abut the matrix elements, Constructing density suppression factors based on cross-class connectivity for indication functions 。
6. The method of claim 5, wherein steps 3-4 include weighting the base purity confidence weights And density inhibitory factor Coupling to generate final life cycle class overlap suppression weight And is opposite to Global normalization to construct a sample-level diagonal lifecycle class overlap suppression matrix Wherein Represent the first The lifecycle class corresponding to each sample overlaps the suppression weight.
7. The method of claim 6, wherein step 4 comprises the steps of: Step 4-1, constructing a sample number balance mechanism based on life cycle evolution abundance, firstly, calculating an unbalanced proportion of a majority class and a minority class samples, carrying out open square operation on the unbalanced proportion and setting an upper limit threshold value to obtain a smoothed basic number factor, then, respectively calculating the ratio of the effective evolution response number of the majority class and the minority class samples under three life cycle evolution dimensions of merging assimilation, associated closure and configuration derivation to the total number of the samples, carrying out arithmetic averaging on the ratio of the three life cycle dimensions to obtain respective comprehensive evolution abundance ratios of the minority class and the majority class, finally, calculating the ratio of the comprehensive evolution abundance ratio of the majority class to the comprehensive evolution abundance ratio of the minority class, defining the ratio as an evolution adjustment coefficient, multiplying the evolution adjustment coefficient by the basic number factor, and constructing a diagonal number weight matrix as a final weighting multiplying power of the minority class samples ; Step 4-2, constructing a convex combination fusion model of the multidimensional constraint weights, and carrying out diagonal quantity weight matrix Sample-level diagonal lifecycle outlier suppression matrix And sample level diagonal lifecycle class overlap suppression matrix Fusion is carried out according to the convex combination form, and a comprehensive weight vector W is obtained: , Wherein the fusion coefficient 、、 Satisfy the following requirements And is also provided with ; Step 4-3, establishing a fusion parameter agent model based on Gaussian process regression, and fusing coefficients for adjusting sample number weight, life cycle outlier suppression weight and life cycle class overlapping suppression weight The search problem of the model is converted into the super-parameter optimization problem facing unbalanced classification performance, a Bayesian optimization method is introduced, gaussian process regression is adopted as a proxy model, and the self-adaptive fusion of the multi-source life cycle weights is realized; Step 4-4, adopting expected improvement criteria as sampling strategies in the Bayesian optimization process, taking classification performance indexes including G-mean, sensitivity, F1-score and Precision as objective functions, iteratively evaluating and updating the proxy model, and finally obtaining the optimal fusion coefficient combination Thereby determining the integrated weight vector X 1 ; And respectively representing the optimal sample number weight, the optimal life cycle outlier rejection weight and the optimal life cycle class overlap rejection weight.
8. The method of claim 7, wherein step 5 comprises the steps of: Step 5-1, initializing parameter system of evolution constraint random configuration network, setting maximum hidden layer node capacity of network Maximum candidate search times generated by node Tolerance residual error expected by system Random weight generation section Shrinkage factor Wherein Representing the upper bound of the amplitude of the random weight of hidden nodes in the random configuration network; Step 5-2 generating intervals from the random weights during the iterative addition of nodes Randomly generating input weights and biases of a group of candidate hidden layer nodes, and calculating output vectors of the candidate hidden layer nodes; Step 5-3, node screening based on supervision constraint is executed, the effectiveness index of candidate nodes is calculated according to the supervision inequality constraint criterion, the effective evolution hidden layer nodes which can reduce the current residual error and meet the linear independence constraint are screened out to be added into a network structure, and the hidden layer output matrix is synchronously updated ; Step 5-4, embedding life cycle comprehensive weight matrix to make output layer solution, constructing comprehensive weight vector X 1 into sample level diagonal weight matrix Solving the weight of the output layer by using the weighted least square method , wherein, In order for the training set to be a label matrix, Representing the generalized inverse of the matrix, Representing a transpose; step 5-5, evaluating the evolution convergence state of the model, calculating the training residual error of the current model, and judging whether the training residual error is smaller than the expected tolerance error Or whether the current hidden layer node number reaches the upper limit of capacity If any stopping condition is met, judging that the model construction is completed, otherwise, updating the current residual error and returning to the step 5-2 to continue to execute incremental addition of the nodes; and 5-6, constructing a final classifier by using the trained model parameters, carrying out classification prediction on the test set, and calculating classification performance indexes.
9. An electronic device comprising a processor and a memory, the memory storing program code that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 8.
10. A storage medium storing a computer program or instructions which, when run on a computer, performs the steps of the method of any one of claims 1 to 8.

Description

Class unbalance analysis method based on data life cycle evolution Technical Field The invention belongs to the field of unbalanced data analysis, and particularly relates to a class unbalanced analysis method based on data life cycle evolution. Background In the application scene of the real world, the problem of data unbalance is common, for example, in medical diagnosis, the number of diseased samples is very small relative to the number of healthy samples, in financial wind control, the fraudulent transaction records only occupy a very small proportion of all transaction samples, and in network security, the difference between malicious invasion flow and normal flow is very large. The unbalance causes that the traditional classifier is often biased to most types of samples in the training process, the recognition rate of few types of samples is obviously reduced, and missed judgment is easy to generate, so that the performance and reliability of the whole system are affected. Researchers have proposed a number of data-level and algorithm-level solutions to the unbalanced data classification problem. In data-level approaches, undersampling, oversampling, and mixed sampling are common. The undersampling method balances data distribution by reducing most types of samples, but key information is easy to lose, the oversampling method expands the number of samples by copying or synthesizing few types of samples, but noise is often introduced or class overlapping is aggravated, and the mixed sampling combines the advantages of the two, but the distribution structure of original data is still possibly damaged. In algorithm-level methods, ensemble learning, cost-sensitive learning, and strategies based on clustering or feature selection are widely used, which can improve minority class recognition capability to some extent, but have strong dependence on data distribution in processing complex data, and generally rely on complex parameter designs with limited generalization capability. However, in deep data analysis, the core contradiction often cannot be solved by pure quantity balancing, because unbalanced data is usually accompanied by more complex evolutionary anomalies and cross entanglement problems. Specifically, the outlier problem is not a simple geometric distance deviation, and is essentially a life cycle abnormality of a sample in the process of generating a data structure, and part of the sample may be transient noise generated instantaneously or delayed individuals free from a main stream evolution path, and if the model is not trained differently, the convergence direction of the model is seriously disturbed. The class overlap problem reflects atypical symbiosis or excessive entanglement of different class samples in early evolution, which is manifested in that heterogeneous samples prematurely establish strong interconnections. The traditional method often lacks the discrimination capability of dynamic evolution, and is difficult to distinguish whether a few types of samples positioned at the boundary are core frameworks with high information quantity or entanglement noise which causes classification confusion. Therefore, how to break through the limitation of static geometric view angle in unbalanced data classification, a set of life cycle evolution model capable of simulating the whole process from isolation to symbiosis of data is established, and the accurate identification of outliers, effective inhibition of interaction overlapping and adaptive weighting of key evolution samples are realized by capturing dynamic responses of samples in different evolution stages of merging assimilation, association closure, configuration derivation and the like, so that the method becomes an important technical problem to be broken through in the field of unbalanced data classification. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a class unbalance analysis method based on the evolution of the data life cycle. According to the method, a life cycle process of sample evolution from an independent individual to a complex structure is simulated, merging assimilation, association closure and configuration derivative indexes representing characteristics of an evolution stage are extracted, a sample-level comprehensive weight matrix considering outlier and overlap suppression is constructed according to the life cycle process, multi-dimensional evolution weight self-adaptive fusion is realized by combining Bayesian optimization, and finally the sample-level comprehensive weight matrix is embedded into an evolution constraint type random configuration network, so that unbalanced data can be classified with high precision. The method comprises the following steps: Step 1, constructing a full life cycle evolution model of a data sample, simulating a dynamic process of evolution of the sample from a discrete independent state to a complex structure form, extracting dynamic ind