CN-121979863-A - Data scheduling method and device, storage medium and electronic equipment

CN121979863ACN 121979863 ACN121979863 ACN 121979863ACN-121979863-A

Abstract

The embodiment of the specification discloses a data scheduling method, a device, a storage medium and electronic equipment, wherein the method comprises the steps of obtaining metadata attribute characteristics and historical access statistical characteristics of candidate data table objects in a first storage system, inputting the metadata attribute characteristics and the historical access statistical characteristics into a pre-trained internal table demand prediction model, outputting acceleration gain predicted values corresponding to the candidate data tables, screening target objects to be accelerated from the candidate data objects based on the acceleration gain predicted values, executing an automatic data migration task, synchronizing data entities of the target objects to be accelerated from the first storage system to a second storage system, and establishing an internal storage table corresponding to the target objects to be accelerated in the second storage system. By adopting the embodiment of the specification, the data use efficiency can be improved.

Inventors

CHEN QIN
LIU JIANHONG
HUANG YU

Assignees

重庆蚂蚁消费金融有限公司

Dates

Publication Date: 20260505
Application Date: 20260330

Claims (11)

1. A data scheduling method, applied to a data processing platform including a first storage system and a second storage system, the second storage system having a lower query response delay than the first storage system, the method comprising: acquiring metadata attribute characteristics and historical access statistical characteristics of candidate data table objects in the first storage system; Extracting physical storage capacity characteristics of the candidate data object based on the metadata attribute characteristics, counting data access frequency characteristics of the candidate data object under a plurality of time windows based on the historical access statistical characteristics, generating a basic dimension characteristic vector based on the physical storage capacity characteristics and the data access frequency characteristic combination, inputting the basic dimension characteristic vector into a pre-trained internal table demand prediction model, and outputting an acceleration gain predicted value for the candidate data object; Screening a target object to be accelerated from the candidate data objects based on the acceleration gain predicted value; and executing an automatic data migration task, synchronizing the data entity of the target object to be accelerated from the first storage system to the second storage system, and establishing an internal storage table corresponding to the target object to be accelerated in the second storage system.
2. The method of claim 1, wherein the plurality of time windows includes a short term time window, a medium term time window, and a long term time window, wherein generating the base dimensional feature vector based on the physical storage capacity feature and the data access frequency feature combination comprises: Normalizing the physical storage capacity characteristics and the data access frequency characteristics to construct a time sequence heat characteristic group comprising short-term burst degree, medium-term stability and long-term decay degree, and determining access density characteristics of unit storage cost based on the physical storage capacity characteristics; Determining a historical query log corresponding to the historical access statistical feature, determining a calculation operator type and a data scanning amount corresponding to a structured query statement in the historical query log, determining a query complexity score based on the calculation operator type and the data scanning amount, and taking the query complexity score as a calculation cost dimension feature; And carrying out feature engineering processing based on the time sequence heat feature group, the access density feature, the calculation cost dimension feature, the physical storage capacity feature and the data access frequency feature to obtain a basic dimension feature vector.
3. The method of claim 2, wherein said inputting the base dimensional feature vector into a pre-trained intra-table demand prediction model, outputting an acceleration gain prediction value for the candidate data object, comprises: Inputting the basic dimension feature vector into a pre-trained internal table demand prediction model; Through the internal table demand prediction model, based on the calculation cost dimension feature and the data access frequency feature in the basic dimension feature vector, reasoning accumulated time-saving calculation time consumption after the candidate data object is migrated to the second storage system, determining a future trend correction coefficient aiming at the accumulated time consumption calculation time consumption based on the time sequence heat feature group, and generating a resource occupation penalty value representing migration cost based on the physical storage capacity feature and the access density feature; And correcting the accumulated query calculation time consumption by adopting the future trend correction coefficient through the internal table demand prediction model to obtain forward resource benefit, obtaining an acceleration gain prediction value based on the forward resource benefit and the resource occupation penalty value, and outputting the acceleration gain prediction value aiming at the candidate data object.
4. The method according to claim 1, wherein the method further comprises: periodically synchronizing the reference metadata attribute features and the reference history access statistics of the existing reference internal storage table from the system table of the second storage system; Inputting the reference metadata attribute characteristics and the reference history access statistical characteristics into the internal table demand prediction model to perform continuous evaluation processing on the reference internal storage table, and calculating a continuous necessity probability value; and if the persistence necessity probability value indicates that the acceleration type is not needed, executing an automatic cleaning task to delete the corresponding reference internal storage table from the second storage system.
5. The method according to any one of claims 1-4, wherein the internal table demand prediction model is trained by: Selecting a data table which is currently configured as an internal memory table in the second memory system as a positive sample, and randomly extracting a data table with a preset proportion from the data tables which are not configured as the internal memory tables in the first memory system as a negative sample; Dividing positive samples and negative samples into a training group and a verification group, performing at least one round of model training on the created initial internal table demand prediction model by using the training group so as to establish a linear mapping relation between input data and internal table demand labels, mapping an output result of the linear mapping relation into a probability value by using a model activation function, and defining the probability value as the acceleration gain prediction value; and performing effect verification on the initial internal table demand prediction model by using the verification group to obtain a trained internal table demand prediction model.
6. The method of claim 1, wherein the first storage system comprises a partitioned storage structure, and wherein the performing an automated data migration task to synchronize the data entity of the target object to be accelerated from the first storage system to the second storage system comprises: Identifying a recommended data partition with the highest acceleration gain predicted value in the target object to be accelerated; Synchronizing the data entity of the recommended data partition to the second storage system, and reserving other partition data of the target object to be accelerated in the first storage system.
7. The method of claim 6, wherein the method further comprises: Creating a mapping table in the second storage system, the mapping table pointing to the full amount of data in the first storage system; The query routing policy is configured such that query request response paths for the recommended data partition are routed through the map table to the first storage system.
8. A data scheduling method and apparatus, applied to a data processing platform including a first storage system and a second storage system, where a query response delay of the second storage system is lower than that of the first storage system, the method and apparatus: the acquisition module is used for acquiring metadata attribute characteristics and historical access statistical characteristics of the candidate data table objects in the first storage system; The prediction module is used for extracting physical storage capacity characteristics of the candidate data objects based on the metadata attribute characteristics, counting data access frequency characteristics of the candidate data objects under a plurality of time windows based on the historical access statistical characteristics, generating a basic dimension characteristic vector based on the physical storage capacity characteristics and the data access frequency characteristics, inputting the basic dimension characteristic vector into a pre-trained internal table demand prediction model, and outputting an acceleration gain predicted value for the candidate data objects; And the scheduling module is used for screening a target object to be accelerated from the candidate data objects based on the acceleration gain predicted value, executing an automatic data migration task, synchronizing the data entity of the target object to be accelerated from the first storage system to the second storage system, and establishing an internal storage table corresponding to the target object to be accelerated in the second storage system.
9. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any one of claims 1 to 7.
10. A computer program product, characterized in that it stores at least one instruction that is loaded by a processor and that performs the method steps of any of claims 1-7.
11. An electronic device comprising a processor and a memory, wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1-7.

Description

Data scheduling method and device, storage medium and electronic equipment Technical Field The present disclosure relates to the field of computer technologies, and in particular, to a data scheduling method, a data scheduling device, a storage medium, and an electronic device. Background In the field of big data processing and analysis, with the rapid development of the service scale of the internet service platform, a server can generate massive user interaction logs and system operation data every day. To guarantee the persistent storage and full-scale analysis capability of data, the industry typically employs offline computing platforms based on a distributed architecture as a core data warehouse. However, such offline computing platforms are designed mainly for large-scale batch computing tasks, and have I/O bottlenecks when processing Ad-hoc Query and high-concurrency data retrieval, and the Query response delay often reaches up to a minute level or even an hour level, so that the requirement of online platform service on data timeliness is difficult to meet. In order to improve the data query efficiency, a high-performance real-time analysis engine based on memory calculation or columnar storage is generally introduced in the related art as an acceleration layer of an offline warehouse. Currently, data scheduling in such heterogeneous storage architecture mainly relies on operation and maintenance personnel to manually configure according to service experience, or to mechanically migrate based on simple static rules (such as synchronizing all data on the same day). The management mode lacks dynamic perceptibility of data access heat change, so that newly generated hot spot data cannot be timely obtained to accelerate query, a large amount of history cold data with extremely low access frequency occupies expensive storage and calculation resources of a real-time engine for a long time, and serious resource waste and high system maintenance cost are caused. Disclosure of Invention The embodiment of the specification provides a data scheduling method, a device, a storage medium and electronic equipment, which can improve the data use efficiency, and the technical scheme is as follows: In a first aspect, an embodiment of the present disclosure provides a data scheduling method applied to a data processing platform including a first storage system and a second storage system, where a query response delay of the second storage system is lower than that of the first storage system, the method including: acquiring metadata attribute characteristics and historical access statistical characteristics of candidate data table objects in the first storage system; inputting the metadata attribute characteristics and the historical access statistical characteristics into a pre-trained internal table demand prediction model, and outputting an acceleration gain predicted value corresponding to the candidate data table; Screening a target object to be accelerated from the candidate data objects based on the acceleration gain predicted value; and executing an automatic data migration task, synchronizing the data entity of the target object to be accelerated from the first storage system to the second storage system, and establishing an internal storage table corresponding to the target object to be accelerated in the second storage system. In a second aspect, embodiments of the present disclosure provide a data scheduling apparatus applied to a data processing platform including a first storage system and a second storage system, where a query response delay of the second storage system is lower than that of the first storage system, where the method apparatus is: the acquisition module is used for acquiring metadata attribute characteristics and historical access statistical characteristics of the candidate data table objects in the first storage system; The prediction module is used for inputting the metadata attribute characteristics and the historical access statistical characteristics into a pre-trained internal table demand prediction model and outputting acceleration gain predicted values corresponding to the candidate data tables; And the scheduling module is used for screening a target object to be accelerated from the candidate data objects based on the acceleration gain predicted value, executing an automatic data migration task, synchronizing the data entity of the target object to be accelerated from the first storage system to the second storage system, and establishing an internal storage table corresponding to the target object to be accelerated in the second storage system. In a third aspect, the present description provides a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-described method steps. In a fourth aspect, the present description provides a computer program product storing at least one instruction adapted to be loaded b