CN-121980231-A - Modeling feature screening method, modeling feature screening device, modeling feature screening medium and modeling feature screening product

CN121980231ACN 121980231 ACN121980231 ACN 121980231ACN-121980231-A

Abstract

The embodiment of the application provides a modeling feature screening method, a modeling feature screening device, a modeling feature screening medium and a modeling feature screening product, and relates to the technical field of artificial intelligence. The method comprises the steps of receiving sample data input by a user, utilizing the sample data, adopting V different evaluation modes to evaluate, determining candidate feature combination sets associated with the evaluation modes and chromosomes based on the front U-bit chromosomes obtained by each evaluation mode, ranking the candidate feature combination sets, and selecting the combination features in the candidate feature combination set with the highest ranking as target features. The scheme of the application solves the problem that the existing feature selection method has limitations in efficiency, generalization and model suitability.

Inventors

LIN HENG
ZHOU WEI
HUANG CHENGWEI
TU CONGHUAN
LIU JINGYU
LIU YUYANG

Assignees

中移(上海)信息通信科技有限公司
中移智行网络科技有限公司
中国移动通信集团有限公司

Dates

Publication Date: 20260505
Application Date: 20260112

Claims (10)

1. A modeling feature screening method, comprising: receiving sample data input by a user; Using the sample data, performing evaluation by adopting V different evaluation modes, and determining candidate feature combination sets related to the evaluation modes and chromosomes based on the front U-bit chromosomes obtained by each evaluation mode; ranking the candidate feature combination sets, and selecting the combination feature in the candidate feature combination set with the highest ranking as the target feature.
2. The method of claim 1, wherein using the sample data, evaluating using V different evaluation methods, and determining a set of candidate feature combinations associated with the evaluation methods and chromosomes based on the top U-bit chromosome obtained by each evaluation method, comprises: determining an evaluation result corresponding to each evaluation mode by using the sample data, wherein the evaluation result is used for representing an adaptability calculation rule based on the corresponding evaluation mode; In the evaluation process of each evaluation mode, determining a chromosome of the U-bit before the corresponding fitness rank in the evaluation mode according to the evaluation result and a preset genetic algorithm; And performing deduplication treatment on the front U-bit chromosome corresponding to each evaluation mode to obtain a candidate feature combination set related to the evaluation modes and the chromosomes.
3. The method according to claim 2, wherein, in the case where the evaluation means includes model predictive evaluation, determining an evaluation result corresponding to each evaluation means using the sample data includes: preprocessing the sample data to determine a training set and a testing set; modeling and predicting the training set and the testing set according to a preset prediction model, and respectively outputting prediction results of the training set and the testing set; Selecting at least one sort or regression performance index, respectively converting minimum value class indexes in the prediction results of the training set and the test set into a maximum value form, and determining corresponding training set evaluation indexes and test set evaluation indexes; And determining an evaluation result according to the training set evaluation index, the test set evaluation index and the ratio of the total feature quantity to the currently selected feature quantity.
4. The method according to claim 2, wherein, in the case where the evaluation means includes feature vector division evaluation, determining an evaluation result corresponding to each evaluation means using the sample data includes: determining the total category number, the total sample number, the sample number of a single category, the total mean vector, the mean vector of the single category and the individual vector of the single category in the sample data based on the sample data; determining an overall intra-class spreading matrix according to the total class number, the total sample number, the single class mean vector and the individual vectors in the single class; determining an overall inter-class spreading matrix according to the total class number, the total sample number, the sample number of a single class, the overall mean vector and the mean vector of the single class; And determining an evaluation result according to the intra-overall-class dispersion matrix and the inter-overall-class dispersion matrix by utilizing a characteristic intra-class and inter-class distance mode or a Fisher criterion mode.
5. The method according to claim 2, wherein, in the case where the evaluation means includes feature vector division evaluation, determining an evaluation result corresponding to each evaluation means using the sample data, further includes: Carrying out standardization processing on the sample data, calculating the difference between the standardized features in the target class sample and the average value of the standardized features in the non-target class sample, and determining an importance matrix associated with the features and the classes; Calculating average feature contribution and high-importance feature contribution according to the importance matrix, wherein the average feature contribution is used for representing the overall prediction capability of the features, and the high-importance feature contribution is used for representing the features with higher contribution degree than the average feature contribution; And determining an evaluation result according to the average characteristic contribution and the high importance characteristic contribution.
6. The method according to claim 2, wherein in the evaluation process of each evaluation mode, determining the chromosome of the top U bits of the fitness rank corresponding to the evaluation mode according to the evaluation result and a preset genetic algorithm comprises: In the evaluation process of each evaluation mode, randomly generating N chromosomes serving as an initial population, wherein the gene sequence of each chromosome is used for representing a group of candidate feature combinations, and the scale and the gene length of the initial population are respectively adapted to the population scale of the initial population and the total amount of features to be screened; executing an iteration process according to a preset iteration total round, and carrying out fitness quantization calculation on all chromosomes in a current population based on the evaluation result of the evaluation mode in each round of iteration process, and sequentially executing selection, inheritance and variation operations, wherein the current population is the initial population in the first round of iteration process; after the iteration of the preset iteration total round, reserving the chromosomes with the front U-bit fitness ranking in all chromosomes in the whole iteration process in the evaluation mode.
7. The method of claim 1, wherein ranking each of the candidate feature combination sets and selecting the combination feature in the highest ranked candidate feature combination set as the target feature comprises: For each candidate feature set, taking the square of the chromosome rank corresponding to the evaluation mode as the score calculation rule of the feature combination for any one of the candidate feature combination sets, and According to the score calculation rule, taking the sum of the scores of all the combined features in the candidate feature combination set as the total score of the candidate feature combination set; And selecting the combined feature in the candidate feature combination set with the highest ranking as the target feature.
8. A modeling feature screening apparatus, comprising: The first processing module is used for receiving sample data input by a user; The second processing module is used for evaluating by adopting V different evaluation modes by utilizing the sample data, and determining a candidate feature combination set related to the evaluation modes and the chromosome based on the previous U-bit chromosome obtained by each evaluation mode; and the third processing module is used for ranking the candidate feature combination sets and selecting the combination feature in the candidate feature combination set with the highest ranking as the target feature.
9. A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements the steps of the method according to any one of claims 1to 7.
10. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 7.

Description

Modeling feature screening method, modeling feature screening device, modeling feature screening medium and modeling feature screening product Technical Field The application relates to the technical field of artificial intelligence, in particular to a modeling feature screening method, a modeling feature screening device, a modeling feature screening medium and a modeling feature screening product. Background Current model building work in the data science industry often involves screening multiple categories, forms of features. In a credit wind control scenario, research and development personnel need to acquire extremely complex information data from multiple channels, and the selection of the model-entering features can determine the final effect of the model to a great extent. Inappropriate feature screening does not provide sufficient information for model prediction, and detailed exploration for each feature would take a significant amount of time, likely extending the development cycle of the model. The mainstream feature screening in the current data science industry is a filtering method, a packaging method and an embedding method. The scheme has the following defects that (1) the filtering method has low characteristic screening efficiency, the emphasis of each filtering condition is different, such as the correlation can only consider the linear relation, the chi-square test can not process discrete variables, and the problem of the synergistic effect among different characteristics is difficult to consider. Secondly, the logic of the screening process itself is susceptible to noise, and may not be able to adapt to the corresponding model, resulting in false puncturing of variables that have an important role in model prediction. (2) The packaging method itself is based on greedy algorithms and therefore operates inefficiently, especially for models that themselves require a long budget. Secondly, the combination of the features may fall into a locally optimal solution, that is, when the two features need to be in synergistic effect, the forward search method may discard the features with synergistic effect after the two combinations. (3) The embedding method is limited to a specific model, for example, the feature importance can only be used by a tree model, regularized balance can also lead to result deviation, and secondly, the method can increase model training burden and reduce model training efficiency. Disclosure of Invention At least one embodiment of the application provides a modeling feature screening method, device, medium and product, which are used for solving the problem that the existing feature selection method has limitations in terms of efficiency, generalization and model suitability. In order to solve the technical problems, the application is realized as follows: in a first aspect, an embodiment of the present application provides a modeling feature screening method, including: receiving sample data input by a user; Using the training set and the test set to evaluate by adopting V different evaluation modes, obtaining a corresponding front U-bit chromosome based on each evaluation mode, and determining candidate feature combination sets associated with the evaluation modes and the chromosome; and carrying out fitness ranking on the candidate feature combination set, and selecting the combination feature with the highest ranking as a target feature. Optionally, using the sample data, performing evaluation by using V different evaluation manners, and determining, based on the front U-bit chromosome obtained by each evaluation manner, a candidate feature combination set associated with the evaluation manner and the chromosome, including: determining an evaluation result corresponding to each evaluation mode by using the sample data, wherein the evaluation result is used for representing an adaptability calculation rule based on the corresponding evaluation mode; In the evaluation process of each evaluation mode, determining a chromosome of the U-bit before the corresponding fitness rank in the evaluation mode according to the evaluation result and a preset genetic algorithm; And performing deduplication treatment on the front U-bit chromosome corresponding to each evaluation mode to obtain a candidate feature combination set related to the evaluation modes and the chromosomes. Optionally, in the case that the evaluation modes include model prediction evaluation, determining an evaluation result corresponding to each evaluation mode by using the sample data includes: preprocessing the sample data to determine a training set and a testing set; modeling and predicting the training set and the testing set according to a preset prediction model, and respectively outputting prediction results of the training set and the testing set; Selecting at least one sort or regression performance index, respectively converting minimum value class indexes in the prediction results of the traini