CN-122019499-A - Batch database operation execution sequence optimization method based on time delay detection

CN122019499ACN 122019499 ACN122019499 ACN 122019499ACN-122019499-A

Abstract

The invention provides a batch database operation execution sequence optimization method based on time delay detection, which improves the execution efficiency of operations such as large-scale deletion, updating and the like by converting random I/O into quasi-sequence I/O. The method combines GMM statistical modeling and LLM cognitive analysis to form a double verification framework, not only can adaptively identify storage performance levels, but also can convert abstract statistical peaks into insights with clear business semantics, and the analysis depth and accuracy of the method are far superior to those of the traditional method. By adopting an intelligent sampling and sequence association full mapping strategy, the physical aggregation relation of the data to be operated can be deduced by detecting data with a very small proportion. The mapping process can be efficiently completed in the database, and the minimized disturbance and global optimization coverage of the production system are realized. And by combining a cognitive throttling mechanism based on real-time load, the high throughput, low delay and predictability of the optimization process are ensured, and the stability of the core service is ensured.

Inventors

YE ZHENGSHENG
WU MIN
ZHOU ZHENXING
YANG YANGYANG

Assignees

玖章算术(浙江)科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260112

Claims (10)

1. A batch database operation execution sequence optimization method based on time delay detection is applied to a black box database environment, and is characterized by comprising the following steps: a) Acquiring a data record identification list to be operated according to service screening conditions, selecting a sequencing field capable of reflecting a data writing time sequence or a logic sequence, and distributing a unique sequence index (seq_index) for each data record identification in the list according to the sequence of the sequencing field; b) Sampling a portion of the record identifications from the list of data record identifications as probe samples; c) Performing baseline calibration on the probe sample, deducting network baseline time delay, performing lightweight query operation on each record identifier in the calibrated probe sample, and measuring access time delay of the record identifier to form a time delay data set containing the record identifier, the sequence index and the time delay; d) Analyzing the time delay data set to identify at least one physical thermal cluster corresponding to a cache hit or a lower access time delay, wherein the physical thermal cluster is composed of a set of probe record identifications logically continuous on the sequence index and having access time delays below a preset performance level; e) Classifying all other record identifications which are not sampled in the data record identification list, namely calculating the weighted distance between the sequence index of each record identification and the sequence index of the probe record identification in the physical hot cluster, and distributing corresponding cluster types for each record identification by adopting a multi-iteration adjacent propagation strategy to form a plurality of optimization batches with physical adjacency; f) And submitting and executing corresponding batch database operations to the database according to the priority order of the optimized batch.
2. The method of claim 1, wherein the analysis of the time delay dataset in step d) comprises a dual verification cognitive modeling process comprising: i, using a Gaussian Mixture Model (GMM) to carry out statistical clustering on the time delay data set so as to identify peaks of a plurality of performance levels and demarcation thresholds thereof; ii analyzing statistical features of the time-lapse dataset using a Large Language Model (LLM) as a cognitive interpretation engine to independently identify performance levels and provide an interpretation of physical significance; And iii, starting an intelligent arbitration flow, and when the analysis results of the GMM and the LLM are inconsistent, determining whether to adopt the statistical result of the GMM or the correction opinion of the LLM according to the peak separation index of the GMM analysis results so as to finally determine the division of the performance level.
3. The method according to claim 1, wherein the algorithm for classifying in step e) is a neighbor classification algorithm or a distance-based classification algorithm, and the classifying step comprises the following steps: i, calculating a weighted distance between a sequence index of each record mark and a sequence index of a probe record mark, wherein the weighted distance has a calculation formula of weighted distance= |target record, seq_index-probe, seq_index|/weight factor, wherein if the probe is a preamble probe with the sequence index smaller than or equal to the target record sequence index, the weight factor takes a first threshold value, and if the probe is a subsequent probe, the weight factor takes a second threshold value, and a higher weight is given to the preamble probe to utilize a database page pre-reading mechanism; ii employing a multi-round iterative strategy using progressively relaxed distance thresholds for proximity classification, including direct proximity classification, proximity propagation diffusion, and final classification attempts with relaxed thresholds; iii the distance threshold is adaptively determined based on the probe distribution density to adapt to the actual distribution characteristics of the different data sets.
4. The method of claim 1, wherein prior to classifying in step e), the method further comprises a sequence association self-verification step of checking whether probe record identifications in the identified physical hot clusters exhibit high continuity in the sequence index, and if the continuity does not satisfy a predetermined condition, terminating the time delay-based optimization process and reverting to a conventional batch mode executed in logical order of the ordered fields.
5. The method of claim 1, wherein the categorizing of step e) is accomplished by executing an SQL query within a database, the process comprising: i, storing the representative sequence index of the identified physical thermal cluster as a representative probe into a temporary table; ii each record is identified by performing a distance calculation and ordering based SQL update operation within the database to find its nearest neighbor representation probe in sequence space.
6. The method of claim 1, further comprising, prior to steps b) and c), a security pre-check step of using EXPLAIN command analysis to obtain an execution plan for the operation of the data record identification list, and terminating the delay-based optimization procedure if the plan is to trigger a full table scan.
7. The method according to claim 1, wherein in the step f), the step of submitting and executing the respective batch database operations to the database in the order of priority of the optimized batches is performed in the order of priority from the batch corresponding to the physical hot cluster to the batch corresponding to the physical cold spot, and differential throttling control strategies are adopted for the batches with different priorities.
8. The method of claim 1, further comprising the step of dynamic re-probing: i monitoring the variation of the execution performance of the batch in the process of executing the optimized batch; And ii) re-executing steps b) to e) on the remaining unprocessed data when the performance degradation is detected to reach a preset condition or to meet a preset re-detection triggering condition, wherein the re-detection triggering condition comprises at least one of the following conditions that the batch execution performance is degraded by more than a preset threshold value, the batch is completed by a preset number of batches, a preset time interval is elapsed or a preset change situation of the database cache hit rate is detected; and iii, regenerating the optimized batch based on the new detection and analysis results and continuing to execute, wherein the dynamic re-detection adopts an incremental strategy to sample and detect only data which is not processed.
9. A database operations optimization system, the electronic device comprising: one or more processors; a memory having stored thereon computer executable instructions; The processor is configured to execute the instructions to implement the method of any one of claims 1 to 11.
10. A computer readable storage medium having stored therein program code which is callable by a processor to perform the method of any one of claims 1 to 11.

Description

Batch database operation execution sequence optimization method based on time delay detection Technical Field The invention relates to the technical field of database performance optimization, in particular to a batch database operation execution sequence optimization method based on time delay detection, a database operation optimization system, electronic equipment and a non-transitory computer readable storage medium. Background Processing large amounts of historical data is a common and troublesome operation and maintenance task in long-term operation of databases. Typical application scenarios include: the historical order of the E-commerce platform is cleaned, namely the completed order is deleted 3 months ago; The periodic archiving of the log system, namely updating the log state 7 days before to be archived; deleting all data of a specific user according to compliance requirements; ETL process of data warehouse, batch updating the status field of dimension table. The common characteristics of these scenarios are large amount of operation data (typically tens of millions or more), operation object determination by business logic (discontinuous primary key), and high requirements for execution efficiency and business stability. When such operations involve tens or even billions of data, developers often find that even if the operation instruction itself is simple (e.g., DELETE FROM WHERE ID IN (..) the execution is extremely slow, and may even cause the database response to timeout, severely affecting on-line traffic. The root of this is physical storage fragmentation. The logical order of the data (e.g., the self-increasing primary key) and its physical storage location on disk become irrelevant after frequent additions and deletions. Therefore, a batch deletion task performed in the primary key ID order may degrade to a large number of random I/Os at the physical level, resulting in frequent seek of the disk head and a dramatic drop in performance. This is also true in distributed systems, where IOs may be hashed onto a large number of backend storage servers. Conventional performance optimization tools typically rely on accessing physical metadata (e.g., ROWID) within the database to solve this problem. However, in modern database operation and maintenance practice, especially in a cloud database (DBaaS) environment, the present invention often needs to solve the problem only through a standard SQL interface without modifying the database kernel and without relying on physical layout information provided by a specific database vendor. Therefore, how to solve the bottleneck of batch operation performance caused by physical fragmentation in the general scene becomes a technical problem to be solved urgently. Disclosure of Invention In order to solve the technical problems in the prior art, the invention provides a performance optimization method for large-scale batch data operation (such as deletion and update), which deduces the physical layout of data by analyzing access time delay and rebuilds the operation sequence to improve the I/O efficiency under the condition of not depending on metadata (black box environment) in a database. The following is the detailed content: In one aspect, a method for optimizing execution sequence of batch database operations based on latency detection is provided, and the method is applied to a black box database environment, and comprises the following steps: a) Acquiring a data record identification list to be operated according to service screening conditions, selecting a sequencing field capable of reflecting a data writing time sequence or a logic sequence, and distributing a unique sequence index (seq_index) for each data record identification in the list according to the sequence of the sequencing field; b) Sampling a portion of the record identifications from the list of data record identifications as probe samples; c) Executing a lightweight query operation on each record identifier in the probe sample, and accurately measuring access time delay of the lightweight query operation to form a time delay data set containing record identifiers, sequence indexes and time delay; d) Analyzing the latency dataset to identify at least one physical thermal cluster corresponding to a cache hit, wherein the physical thermal cluster is comprised of a set of probe record identifications logically consecutive on the sequence index with access latency at a lowest performance level; e) Classifying all other record identifications which are not sampled in the data record identification list in one-dimensional KNN, namely calculating the weighted distance between the sequence index of each record identification and the sequence index of the probe record identification in the physical thermal cluster, and distributing corresponding cluster types for each record identification by adopting a multi-round iterative adjacent propagation strategy to form a plurality of optimized batc