CN-114117167-B - Method, device and storage medium for analyzing validity of newly added service data

CN114117167BCN 114117167 BCN114117167 BCN 114117167BCN-114117167-B

Abstract

The invention discloses a method, equipment and storage medium for analyzing the effectiveness of newly added service data. The method comprises the steps of determining newly added candidate features from newly added service data when an evaluation request of the newly added service data is detected, calculating correlation coefficients between candidate feature vectors corresponding to the newly added candidate features and label vectors of service models to obtain coefficient vectors of the newly added candidate features, calculating distances between the coefficient vectors of the newly added candidate features and center vectors of various clusters, and determining target feature clusters to which the newly added candidate features belong, so that effectiveness of the newly added candidate features to the service models is determined according to stock features in the target feature clusters. The method for analyzing the validity of the newly-added service data has lower calculation complexity, and further can improve the efficiency of analyzing the validity of the newly-added service data.

Inventors

LI HONGYOU
ZHONG HAOMING
LIANG JIAN
Zou Jingfu
XU AHONG
LV CHENGDONG
ZHANG HAICHUAN
CHEN WEN

Assignees

深圳前海微众银行股份有限公司

Dates

Publication Date: 20260512
Application Date: 20211130

Claims (9)

1. The method for analyzing the validity of the newly added service data is characterized by comprising the following steps: When an evaluation request of new business data is detected, determining new candidate features from the new business data, acquiring feature values of each first sample under the new candidate features from the new business data to form candidate feature vectors corresponding to the new candidate features, and acquiring label values of each first sample under a target label of a business model to form first label vectors corresponding to the business model; Respectively calculating first correlation coefficients between candidate feature vectors corresponding to the newly added candidate features and first label vectors corresponding to a plurality of business models, and forming a plurality of first correlation coefficients into coefficient vectors corresponding to the newly added candidate features; Calculating distances between coefficient vectors corresponding to the newly added candidate features and center vectors of various clusters, and determining target feature class clusters to which the newly added candidate features belong according to distance calculation results so as to determine the effectiveness of the newly added service data on each service model according to the effectiveness of stock features in the target feature class clusters on each service model; Wherein, each cluster center vector is obtained based on coefficient vector clustering corresponding to each stock feature in the stock service data, and before the step of determining the newly added candidate feature from the newly added service data when detecting the evaluation request of the newly added service data, the method further comprises: Acquiring stock service data and determining stock characteristics from the stock service data; Acquiring feature values of each second sample under the stock features from the stock service data to form stock feature vectors corresponding to the stock features, and acquiring label values of each second sample under target labels of the service model to form second label vectors corresponding to the service model; Respectively calculating second correlation coefficients between stock feature vectors corresponding to the stock features and second tag vectors corresponding to a plurality of service models, and forming a plurality of second correlation coefficients into coefficient vectors corresponding to the stock features; clustering coefficient vectors corresponding to the stock features, dividing each stock feature into a plurality of feature class clusters, and obtaining class cluster center vectors corresponding to each feature class cluster.
2. The method of claim 1, wherein the step of calculating a second correlation coefficient between the stock feature vector corresponding to the stock feature and a second tag vector corresponding to the business model comprises: Calculating the distance between every two elements in the stock feature vector to obtain a first distance matrix, and calculating the distance between every two elements in the second label vector to obtain a second distance matrix; Adding the element mean value of the first distance matrix after subtracting the same-row mean value and the same-column mean value from each element in the first distance matrix to obtain a third distance matrix, and adding the element mean value of the second distance matrix after subtracting the same-row mean value and the same-column mean value from each element in the second distance matrix to obtain a fourth distance matrix; Dividing the sum of squares of the elements in the third distance matrix by the number of columns and then calculating the square root to obtain a first numerical value, dividing the sum of squares of the elements in the fourth distance matrix by the number of columns and then calculating the square root to obtain a second numerical value; And calculating the square root of the multiplied first value and the multiplied second value to obtain a fourth value, and dividing the third value by the fourth value to obtain a second correlation coefficient between the stock feature vector and the second label vector.
3. The method for analyzing validity of newly added service data according to claim 1, wherein the step of clustering coefficient vectors corresponding to the plurality of stock features, dividing each stock feature into a plurality of feature class clusters, and obtaining a class cluster center vector corresponding to each feature class cluster comprises: initializing a preset number of cluster center vectors; respectively calculating the distance between the coefficient vector corresponding to the stock feature and the center vector of each class cluster, and respectively distributing a plurality of stock features to feature class clusters corresponding to the nearest class cluster center vector; averaging coefficient vectors corresponding to the stock features in the feature cluster to obtain a new cluster center vector, and detecting whether a preset cluster ending condition is met; If the clustering ending condition is determined to be met, ending the clustering; And if the clustering ending condition is not met, returning to the step of executing the calculation of the distances between the coefficient vectors corresponding to the stock features and the cluster center vectors based on the new cluster center vectors.
4. The method for analyzing validity of newly added service data according to claim 1, wherein the step of obtaining feature values of each second sample under the stock feature from the stock service data to form a stock feature vector corresponding to the stock feature comprises: Acquiring characteristic values of each second sample under the stock characteristics from the stock service data; and replacing the obtained empty characteristic values in the characteristic values with preset values to form the storage characteristic vector corresponding to the storage characteristic.
5. The method for analyzing validity of newly added service data according to claim 1, wherein the step of determining a target feature class cluster to which the newly added candidate feature belongs according to a distance calculation result comprises: When the distance between the coefficient vector of the newly added candidate feature and one of the cluster center vectors of each cluster center vector is smaller than a preset value, taking a feature cluster corresponding to the cluster center vector of which the coefficient vector distance of the newly added candidate feature is smaller than the preset value as a target feature cluster to which the newly added candidate feature belongs; When the distance between the coefficient vector of the newly added candidate feature and the plurality of cluster center vectors of each cluster center vector is smaller than the preset value, taking the feature cluster corresponding to the cluster center vector with the smallest coefficient vector distance of the newly added candidate feature as the target feature cluster to which the newly added candidate feature belongs; And when the distance between the coefficient vector of the newly added candidate feature and the center vector of each cluster is not smaller than the preset value, adding a feature cluster as a target feature cluster to which the newly added candidate feature belongs.
6. The method for analyzing validity of newly added service data according to claim 1, further comprising, after the step of determining a target feature class cluster to which the newly added candidate feature belongs according to a distance calculation result: adding the newly added candidate feature to the target feature class cluster to increase the feature quantity in the target feature class cluster; when a feature selection request aiming at a target service model is detected, obtaining effect information of each stock feature on each service model, and determining a target stock feature with a forward effect on training of the target service model from each stock feature according to the effect information; and acquiring new candidate features from the feature class cluster where the target stock features are located as new features of the target service model to output.
7. The method for analyzing validity of added service data according to any one of claims 1 to 6, wherein when an evaluation request of added service data is detected, a new candidate feature is determined from the added service data, and the step of acquiring feature values of each first sample under the new candidate feature from the added service data to form a candidate feature vector corresponding to the new candidate feature includes: when an evaluation request of the newly added service data is detected, determining a plurality of original features from the newly added service data; transforming the original features according to a preset cross transformation method to obtain newly added candidate features; acquiring characteristic values of each first sample under the original characteristics from the newly added service data, and respectively calculating the characteristic values of each first sample under the original characteristics according to a transformation formula corresponding to the preset cross transformation method to obtain the characteristic values of each first sample under the newly added candidate characteristics; and forming the characteristic value of each first sample under the newly added candidate characteristic into a candidate characteristic vector corresponding to the newly added candidate characteristic.
8. A validity analysis device of newly added service data, characterized in that the validity analysis device of newly added service data comprises a memory, a processor and a validity analysis program of newly added service data stored on the memory and executable on the processor, the validity analysis program of newly added service data realizing the steps of the validity analysis method of newly added service data according to any one of claims 1 to 7 when being executed by the processor.
9. A computer-readable storage medium, wherein a validity analysis program of newly added service data is stored on the computer-readable storage medium, the validity analysis program of newly added service data realizing the steps of the validity analysis method of newly added service data according to any one of claims 1 to 7 when executed by a processor.

Description

Method, device and storage medium for analyzing validity of newly added service data Technical Field The present invention relates to the field of feature engineering technologies, and in particular, to a method, an apparatus, and a storage medium for analyzing validity of newly added service data. Background With the development of internet financial technology and big data technology, the way of acquiring data is more and more abundant, the order of magnitude of the characteristic dimension of an enterprise financial information database can reach thousands of levels, and various problems such as fitting, high model complexity, large calculation amount and the like can occur when the business data are directly applied to each business model for training. Therefore, how to quickly analyze the validity of business data for different business models is a problem faced by all current financial and technological industries. Because of the continuous update of the enterprise database, the effectiveness of the newly added service data on the service model needs to be continuously mined, and the newly added service data is applied to the training of the service model so as to play the role of the newly added service data. When judging the effectiveness of the newly added service data, namely whether a forward effect is generated on the training of the service model, in the existing analysis method, the features in the newly added service data and the existing features are required to be calculated in pairs, and then the effectiveness of the newly added service data is estimated according to the effectiveness of the existing features. Disclosure of Invention The invention mainly aims to provide a method, equipment and storage medium for analyzing the effectiveness of newly-added service data, and aims to solve the technical problems that the existing method for analyzing the effectiveness of the newly-added service data calculates the correlation between the characteristics in the newly-added service data and the existing characteristic vector and has high time complexity. In order to achieve the above object, the present invention provides a method for analyzing validity of newly added service data, the method comprising the steps of: When an evaluation request of new business data is detected, determining new candidate features from the new business data, acquiring feature values of each first sample under the new candidate features from the new business data to form candidate feature vectors corresponding to the new candidate features, and acquiring label values of each first sample under a target label of a business model to form first label vectors corresponding to the business model; Respectively calculating first correlation coefficients between candidate feature vectors corresponding to the newly added candidate features and first label vectors corresponding to a plurality of business models, and forming a plurality of first correlation coefficients into coefficient vectors corresponding to the newly added candidate features; Calculating distances between coefficient vectors corresponding to the newly added candidate features and center vectors of various clusters, and determining target feature clusters to which the newly added candidate features belong according to distance calculation results so as to determine the effectiveness of the newly added service data on each service model according to the effectiveness of stock features in the target feature clusters on each service model, wherein each cluster center vector is obtained by clustering the coefficient vectors corresponding to each stock feature in the stock service data. Optionally, before the step of determining the new candidate feature from the new service data when the evaluation request of the new service data is detected, the method further includes: Acquiring stock service data and determining stock characteristics from the stock service data; Acquiring feature values of each second sample under the stock features from the stock service data to form stock feature vectors corresponding to the stock features, and acquiring label values of each second sample under target labels of the service model to form second label vectors corresponding to the service model; Respectively calculating second correlation coefficients between stock feature vectors corresponding to the stock features and second tag vectors corresponding to a plurality of service models, and forming a plurality of second correlation coefficients into coefficient vectors corresponding to the stock features; clustering coefficient vectors corresponding to the stock features, dividing each stock feature into a plurality of feature class clusters, and obtaining class cluster center vectors corresponding to each feature class cluster. Optionally, the step of calculating a second correlation coefficient between the stock feature vector corresponding to the stock feature and a second