CN-119889460-B - Gene expression profile classification method and system based on feature dependence and multi-target particle swarm optimization feature selection

CN119889460BCN 119889460 BCN119889460 BCN 119889460BCN-119889460-B

Abstract

The invention provides a gene expression spectrum classification method and system based on feature dependence and multi-target particle swarm optimization feature selection, comprising the following steps of calculating the dependence score of each feature of gene expression spectrum data, quantifying the dependence relationship between the features through information entropy and mutual information, initializing a particle swarm based on the dependence score, preferentially selecting the feature with high dependence score, adopting a multi-target particle swarm optimization algorithm to perform feature selection, combining a classification error rate and feature selection rate as target functions, realizing balance of low error rate and feature subset selection in the optimization process, adopting an improved particle optimal position updating strategy, generating a new solution through a non-dominant solution updating mechanism based on cubic spline interpolation, improving the global searching capability of the particle swarm, and obtaining an optimal feature subset through a multi-target optimization process. According to the invention, through optimizing the characteristic selection process and improving the particle swarm optimization strategy, the classification performance and the characteristic selection efficiency are improved, and the dependency relationship among genes can be effectively captured.

Inventors

HAN FEI
WANG MENGRU

Assignees

江苏大学

Dates

Publication Date: 20260512
Application Date: 20250217

Claims (10)

1. The gene expression profile classification method based on feature dependence and multi-target particle swarm optimization feature selection is characterized by comprising the following steps: Step S1, downloading gene expression spectrum data, calculating the dependency score of each feature, and quantifying the dependency relationship among the features by an information entropy and mutual information method, step S2, initializing a first generation particle swarm by using a population initialization strategy based on feature dependency according to the dependency score obtained in step S1, and giving priority to the features with high dependency score to generate a particle initial position; step S4, selecting individual optima and global optima of various groups by calculating and comparing the adaptive values generated in the step S3, generating a new non-dominant solution by using a particle optimal position updating strategy based on cubic spline interpolation when updating the individual optima, and generating a Pareto front by improving the distribution uniformity of the solution; Step S5, searching through a multi-objective optimization process, and selecting an optimal feature subset on the Pareto front generated in the step S4; Step S7, if the termination condition is met, entering a step S8, otherwise jumping to a step S4 to circulate; And S8, ending.
2. The method for classifying gene expression profiles based on feature-dependent and multi-objective particle swarm optimization feature selection according to claim 1, wherein the step S1 comprises the steps of calculating univariate information quantity of each feature by using information entropy, and calculating mutual information value of each feature and the rest of features, and quantifying the dependency of the features by using a dependency score formula, wherein the dependency score formula is as follows: Wherein, the And Represents the jth and kth features respectively, Representing characteristics And features And step S1.3, normalizing the dependent score and taking the normalized dependent score as a weight basis in the subsequent population initialization process.
3. The method for classifying gene expression profiles based on feature-dependent and multi-target particle swarm optimization feature selection according to claim 2, wherein said step S1.3 further comprises the steps of: Calculating the probability of selection of each feature For subsequent initialization, the formula is as follows: Wherein, the Represents the dependency score of the j-th feature, λ=0.8, σ=10.
4. The method for classifying gene expression profiles based on feature-dependent and multi-objective particle swarm optimization feature selection according to claim 1, wherein said step S2 comprises the steps of initializing parameters of a particle swarm optimization algorithm PSO including the number of iterations T, inertial weights S2.1 Random variable between E [0,1] 、 Determining the upper bound of the velocity of the particles Self-learning factor Global learning factors Step S2.2, generating initial particle groups based on the dependent scores and the transfer function, adopting a probability selection strategy based on the dependent scores, wherein each particle represents a feature subset, namely generating n populations, and randomly initializing the positions, the speeds and the adaptive values of the particles.
5. The method for classifying gene expression profiles based on feature-dependent and multi-objective particle swarm optimization feature selection according to claim 4, wherein the position vector of each particle in step S2.2 represents a feature subset, and the value of each dimension is 0 or 1, wherein 1 represents that the feature is selected and 0 represents that the feature is not selected; Wherein, the A position vector representing the particles is represented, Represents the jth feature of the ith particle, The probability of selection for the j-th feature.
6. The method for classifying gene expression profiles based on feature-dependent and multi-objective particle swarm optimization feature selection according to claim 1, wherein said step S3 comprises the steps of S3.1, setting a first objective function as a feature selectivity calculation formula as follows: Wherein, the Is the total number of features that are present, Is the first The presence of the individual particles of the polymer, Position information representing a j-th feature of an i-th particle in a t-th iteration; Step S3.2, setting a second objective function as a classification error rate calculation formula as follows: Wherein, the Representing false positive, false negative, true positive and true negative, respectively.
7. The method for classifying gene expression profiles based on feature-dependent and multi-objective particle swarm optimization feature selection according to claim 1, wherein the step S4 comprises the steps of S4.1, performing smooth interpolation on a non-dominant solution set by adopting an update strategy based on cubic spline interpolation to generate a new non-dominant solution, and improving global performance of particle swarm search, wherein a calculation formula of an interpolation function is as follows: wherein coefficients of the cubic polynomial 、、、 Respectively for ensuring the smoothness of the generated data, The value of the interpolation function at the node is controlled, The degree of tilt of the control curve is related to the slope of the interpolation function, The curvature change is affected in relation to the second derivative of the curve, Determining the smooth transition of the influence control curve of the cubic term among nodes, and uniformly covering the Pareto front by using a cubic spline interpolation function if the GD (X) from the non-dominant solution to the Pareto front is smaller than the GD (X) from the nearest Pareto solution according to the distance from the non-dominant solution to the nearest Pareto solution set, wherein the distance calculation formula is as follows: Wherein N is the population scale, Representing the distance between the i-th solution point on the target space and the nearest solution point on the Pareto front.
8. The method for classifying gene expression profiles based on feature-dependent and multi-objective particle swarm optimization feature selection according to claim 1, wherein said step S5 comprises the step S5.1 of guiding particles to move toward Pareto' S optimal direction by updating the position and velocity of the particles, each particle updating its own velocity And position The formula is as follows: Where i denotes the current particle, d denotes the d-th dimension of the search space, t denotes the t-th iteration of the evolution process, ω denotes the inertial weight, Representing the velocity of particle i in the d-th dimension at the t-th iteration, Representing the position of particle i in the d-th dimension at the t-th iteration, Representing the personal history optimal position of particle i in the d-th dimension, The method comprises the steps of expressing the global history optimal position of a particle i on a d-th dimension, wherein c 1 and c 2 are acceleration constants, r 1 and r 2 are random variables between E [0,1], screening and archiving non-dominant solutions by using adaptive values of the particle, and generating a final Pareto front by selecting an optimal solution from the storage in a step S5.2.
9. The method for classifying gene expression profiles based on feature-dependent and multi-objective particle swarm optimization feature selection according to claim 1, wherein the step S6 comprises the steps of inputting the optimized feature subset into a classifier for training, evaluating classification performance of the classifier on a test set, verifying effectiveness and classification accuracy of the selected feature, and repeating the step S6.1-S6.2 experiments for a gene expression dataset to verify feasibility of an algorithm, and recording classification accuracy and the number of the selected features.
10. A system for implementing the gene expression profile classification method based on feature-dependent and multi-objective particle swarm optimization feature selection according to any one of claims 1-9, comprising a feature score calculation module, a particle location initialization module, an fitness value calculation module, an interpolation calculation module, an optimal feature set selection module, and a method verification module; The feature score calculation module is used for calculating the dependency score of each feature according to GSE number download gene expression spectrum data Colon, and quantifying the dependency relationship between the features through an information entropy and mutual information method; The particle position initializing module is used for initializing a first generation particle swarm by utilizing a population initializing strategy based on characteristic dependence according to the dependence score obtained by the characteristic score calculating module, and giving priority to the characteristic with high dependence score to generate a particle initial position; The fitness value calculation module is used for optimizing the characteristic selection problem by adopting a multi-objective particle swarm optimization algorithm according to the initial particle position generated by the particle position initialization module, combining the classification error rate and the characteristic selection rate as objective functions, and calculating by using the objective functions to obtain a fitness value; the interpolation calculation module is used for selecting individual optima and global optima of various groups by calculating and comparing the adaptation values generated by the adaptation value calculation module, generating a new non-dominant solution by using a particle optimal position updating strategy based on cubic spline interpolation when updating the individual optima, improving the distribution uniformity of the solution and generating a Pareto front; the optimal feature set selection module is used for searching through a multi-objective optimization process, and selecting an optimal feature subset on the Pareto front generated by the interpolation calculation module; The method verification module is used for inputting the optimal feature subset selected by the optimal feature set selection module into a plurality of classifiers for training and testing, and verifying the effectiveness of the method.

Description

Gene expression profile classification method and system based on feature dependence and multi-target particle swarm optimization feature selection Technical Field The invention belongs to the technical field of biological information, and particularly relates to a gene expression profile classification method based on feature dependence and multi-target particle swarm optimization feature selection, which is widely applied to tasks such as high-dimensional data analysis, machine learning, gene data classification and the like. Background The capability of human analysis of gene expression data has been developed in recent years, however, in the current society, the biomedical data volume has been drastically increased, and all gene data analysis works are not only time-consuming and labor-consuming, but also inefficient if they are performed manually. With the rapid development of computer technology and machine learning algorithms, genetic data classification is becoming an important technology in the field of life science. Various complex and accurate mathematical models have been developed, and computers have been able to help researchers process large amounts of complex gene expression data, providing support for disease diagnosis and personalized medicine. In the field of bioinformatics, analysis of gene expression profile data is one of the most important parts. With the advent of the big data age, gene expression profile data has become an important resource for studying disease mechanisms and gene functions. Researchers have placed higher demands on the accuracy of classification of gene expression data. However, since the genetic data generally has the characteristic of "high-dimensional small samples", it is difficult for the conventional classification algorithm to efficiently process redundant features and noise in the high-dimensional data, resulting in poor classification performance. Therefore, the development of an efficient feature selection method has important practical significance for the research of gene data classification. Analysis of gene expression data is a typical high-dimensional data processing problem. Redundant features and noise not only increase computational complexity, but also reduce the accuracy and generalization ability of the classification model. Therefore, how to efficiently select key features while reducing computational overhead is a core problem in the field of classification of gene expression data. Disclosure of Invention Aiming at the technical problems, the invention provides a gene expression profile classification method and a system based on feature dependence and multi-target particle swarm optimization feature selection, which are aimed at optimizing feature selection to improve classification accuracy and capture dependency relationship among genes. Note that the description of these objects does not prevent the existence of other objects. Not all of the above objects need be achieved in one embodiment of the present invention. Other objects than the above objects can be extracted from the description of the specification, drawings, and claims. In the problems of feature selection and classification, the conventional method often has insufficient consideration of the dependency relationship among features, so that redundant features influence the optimization result, and meanwhile, the feature selection scale and classification performance are difficult to consider in the optimization process. In order to solve the problems, the invention provides a multi-target particle swarm optimization feature selection method based on feature dependence, which utilizes feature dependence to optimize population initialization and combines an improved particle position updating strategy to obviously improve the efficiency and classification performance of feature selection. Particle Swarm Optimization (PSO) is a heuristic search algorithm that finds the optimal solution in solution space by the synergistic effect of individuals and populations. However, conventional PSO algorithms such as MOEA/D-PSO algorithms suffer from premature convergence, solution maldistribution, etc. when dealing with high-dimensional data and complex optimization problems, and optimization performance is limited. Aiming at the problems, the invention adopts a multi-target particle swarm optimization algorithm, combines a characteristic-dependent population initialization strategy and a cubic spline interpolation-based particle individual optimal position updating mechanism, and realizes the global optimization of the characteristic selection problem through the double optimization of the classification error rate and the characteristic selection rate. The gene expression profile data classification method based on feature dependence and multi-target particle swarm optimization feature selection is used for optimizing feature selection to improve classification accuracy and capturing depen