Search

CN-122001824-A - Data processing method and device, nonvolatile storage medium and electronic equipment

CN122001824ACN 122001824 ACN122001824 ACN 122001824ACN-122001824-A

Abstract

The application discloses a data processing method and device, a nonvolatile storage medium and electronic equipment. The method comprises the steps of obtaining a file list to be transmitted, detecting storage performance indexes of a storage system for storing the file list to be transmitted and network bandwidth of a transmission environment, determining target fragmentation number for carrying out fragmentation processing on files according to total data quantity, storage performance indexes and network bandwidth of the file list to be transmitted, extracting a plurality of files to be transmitted, of which the data quantity is smaller than a first preset threshold value, in the file list to be transmitted, merging the files to be transmitted into a logic block of the preset data quantity in the file list to be transmitted to obtain a target file list to be transmitted, and carrying out fragmentation processing on the target file list to be transmitted according to the target fragmentation number to obtain target fragments. The application solves the technical problems of low transmission efficiency and long task time consumption caused by the fact that the related technology can not dynamically adjust the fragment scale according to the network bandwidth fluctuation and the storage system performance.

Inventors

  • DAI ZHENCONG
  • ZHOU DONGXU
  • SUN JIAN
  • LIU DONG

Assignees

  • 中国电信股份有限公司

Dates

Publication Date
20260508
Application Date
20260226

Claims (11)

  1. 1. A method of data processing, comprising: acquiring a file list to be transmitted, and detecting a storage performance index of a storage system for storing the file list to be transmitted and a network bandwidth of a transmission environment; Determining a target fragmentation number for performing fragmentation processing on the files according to the total data amount of the file list to be transmitted, the storage performance index and the network bandwidth; extracting a plurality of files to be transmitted, the data quantity of which is smaller than a first preset threshold value, from the file to be transmitted list, and merging the files to be transmitted into a logic block with preset data quantity in the file to be transmitted list to obtain a target file to be transmitted list; and performing slicing processing on the file list to be transmitted according to the target slicing number to obtain target slicing.
  2. 2. The method of claim 1, wherein determining a target number of fragments for performing a fragmentation process on a file based on the total data amount of the list of files to be transferred, the storage performance index, and the network bandwidth comprises: Determining a comprehensive load bearing factor according to the product of the storage performance index and the network bandwidth, wherein the storage performance index at least comprises the steps of storing the read-write times per second; determining the decreasing proportion of the network bandwidth of the current detection period relative to the network bandwidth of the previous adjacent detection period; and calculating the total data quantity, the comprehensive load bearing factor and the descending proportion through a preset slicing decision function to obtain the target slicing quantity, wherein the target slicing quantity and the total data quantity are in a positive correlation function relation, and the target slicing quantity and the comprehensive load bearing factor are in a negative correlation function relation.
  3. 3. The method of claim 2, wherein calculating the total data amount, the integrated load carrying factor, and the drop ratio by a preset shard decision function to obtain the target shard number comprises: Calculating the total data quantity, the comprehensive load bearing factor and the descending proportion through a preset slicing decision function to obtain an initial slicing value; Comparing the initial fragment number value with a preset fragment number upper limit value, and determining a smaller value of the initial fragment number value and the fragment number upper limit value as an intermediate result, wherein the fragment number upper limit value is set according to the hardware concurrency queue depth of a target storage medium and the available memory capacity of a current execution environment; Comparing the intermediate result with a preset lower limit value of the number of fragments, and determining the larger value of the intermediate result and the lower limit value of the number of fragments as the target number of fragments, wherein the lower limit value of the number of fragments is set according to the number of processor cores of the current computing environment and the inherent delay period of single task scheduling.
  4. 4. The method of claim 1, wherein obtaining a list of files to be transferred comprises: detecting a file change event and acquiring a target service type corresponding to the file change event; According to the target service type, matching a corresponding target time window threshold value from a preset window configuration library, wherein the window configuration library at least comprises a first time window threshold value corresponding to a first service type, a second time window threshold value corresponding to a second service type and a third time window threshold value corresponding to a third service type; The first service type corresponds to a first delay sensitivity, the second service type corresponds to a second delay sensitivity, and the third service type corresponds to a third delay sensitivity, wherein the first delay sensitivity, the second delay sensitivity and the third delay sensitivity are inversely related to the tolerance degree of a service object to delay, the value of the first delay sensitivity is larger than the value of the second delay sensitivity, the value of the second delay sensitivity is larger than the value of the third delay sensitivity, and the third time window threshold is larger than the second time window threshold and the second time window threshold is larger than the first time window threshold; Starting timing based on the target time window threshold value, and caching file change events with the same characteristics generated in a timing period to generate a set of events to be processed; And under the condition that the timing reaches the target time window threshold, carrying out aggregation and de-duplication processing on the event set to be processed to obtain the file list to be transmitted.
  5. 5. The method of claim 4, wherein starting a timer based on the target time window threshold comprises: Counting the event stacking density of the event set to be processed in the caching period; When the event stacking density exceeds a preset congestion threshold value, carrying out dynamic scaling treatment on the target time window threshold value according to a preset proportion to obtain a temporary window threshold value; And triggering an aggregation deduplication processing step to release the cache resources in response to the temporary window threshold.
  6. 6. The method of claim 1, wherein after obtaining the target tile, the method further comprises: Acquiring a plurality of slicing tasks for transmitting the target slicing, and sending the plurality of slicing tasks to a main execution node and a standby execution node in a preset execution cluster; Distributing a plurality of corresponding target transmission threads for the plurality of fragmented tasks based on a dual-level concurrency control model in the main execution node, wherein the dual-level concurrency control model is used for carrying out load balancing scheduling on the plurality of fragmented tasks in the main execution node, and activating a plurality of target transmission threads matched with the plurality of fragmented tasks on line Cheng Chizhong according to a preset calculation power quota of the main execution node; determining memory mapping of each corresponding slice of the plurality of slice tasks in a kernel space based on the target transmission thread, mapping the slices from a disk to a network protocol stack through the memory mapping, and executing data sending operation through the plurality of target transmission threads; Detecting the execution state of the data transmission operation, determining the current transmission progress according to the successfully transmitted data quantity, and storing the current transmission progress into the storage system; and under the condition that the main execution node fails, the standby execution node acquires the current transmission progress from the storage system and continuously executes the plurality of slicing tasks from the subsequent position of the current transmission progress.
  7. 7. The method of claim 6, wherein performing data transmission operations by the plurality of target transfer threads comprises: Acquiring a plurality of operation indexes in a data transmission link, and dynamically adjusting the data transmission frequency according to the operation indexes, wherein the operation indexes at least comprise a transmission layer index for representing the transmission quality of a network, a storage layer index for representing the response performance of storage equipment and a service layer index for representing the integrity and compliance of data; Based on structural feature retrieval of a preset pattern matching rule, identifying sensitive information with fixed format features, and obtaining a first identification result; based on unstructured semantic analysis of the deep learning semantic model, calculating semantic relativity between the target fragments and preset sensitive categories to obtain a second recognition result; When the first identification result and/or the second identification result meet a preset compliance triggering condition, executing a preset association protection action on the target sensitive data in the target fragments to obtain processed fragment data; and sending the original sliced data or the processed sliced data without triggering the compliance condition through the plurality of target transmission threads.
  8. 8. A data processing apparatus, comprising: The acquisition module is used for acquiring a file list to be transmitted and detecting a storage performance index of a storage system for storing the file list to be transmitted and network bandwidth of a transmission environment; The determining module is used for determining the target fragmentation number for carrying out fragmentation processing on the files according to the total data amount of the file list to be transmitted, the storage performance index and the network bandwidth; the processing module is used for extracting a plurality of files to be transmitted, the data quantity of which is smaller than a first preset threshold value, from the file list to be transmitted, and combining the files to be transmitted into a logic block with preset data quantity in the file list to be transmitted to obtain a target file list to be transmitted; And the generating module is used for carrying out slicing processing on the target file list to be transmitted according to the target slicing number to obtain target slices.
  9. 9. A non-volatile storage medium, characterized in that the non-volatile storage medium comprises a stored program, wherein the program, when run, controls a device in which the non-volatile storage medium is located to perform the data processing method of any one of claims 1 to 7.
  10. 10. An electronic device comprising a memory and a processor for executing a program stored in the memory, wherein the program is executed to perform the data processing method of any one of claims 1 to 7.
  11. 11. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the data processing method of any of claims 1 to 7.

Description

Data processing method and device, nonvolatile storage medium and electronic equipment Technical Field The application relates to the technical field of cloud computing storage, in particular to a data processing method and device, a nonvolatile storage medium and electronic equipment. Background In data-intensive application scenarios such as intelligent computing, large model training, etc., enterprises need to frequently perform full and incremental migration of unstructured data among multiple distributed object storage systems. The tools widely adopted in the current industry, such as s5cmd, rclone, AWS DATASYNC and the like, have basic parallel transmission capability, but the core scheduling mechanism generally depends on a preset fixed slicing strategy (such as default 32MB slicing size) and does not consider the dynamic change of the actual transmission environment. When facing massive small files (such as logs, images and text fragments with the size of more than or equal to 10 KB), the tools still independently initiate S3 API requests for each file, so that the number of HTTP requests increases exponentially, and the linkage problems of network connection pool exhaustion, object storage metadata service overload, IOPS bottleneck and the like are caused, so that the actual throughput is far lower than the upper limit of a physical bandwidth. Further, existing tools cannot sense or adapt in real time to network bandwidth fluctuations and changes in storage system performance (e.g., IOPS, latency). For example, when the network bandwidth is suddenly reduced from 10Gbps to 5Gbps, the fixed slicing strategy still maintains the original slicing quantity, which causes overlarge monolithic data quantity and frequent transmission timeout, while in the high IOPS storage environment, low concurrency slicing is still adopted, and the parallel processing capability of the storage device is not fully utilized. In view of the above problems, no effective solution has been proposed at present. Disclosure of Invention The application provides a data processing method, a device, a nonvolatile storage medium and electronic equipment, which at least solve the technical problems of low transmission efficiency and excessively long task time consumption caused by the fact that related technologies cannot dynamically adjust the fragment scale according to network bandwidth fluctuation and storage system performance. According to one aspect of the application, a data processing method is provided, which comprises the steps of obtaining a file list to be transmitted, detecting a storage performance index of a storage system for storing the file list to be transmitted and network bandwidth of a transmission environment, determining a target fragmentation number for carrying out fragmentation processing on files according to total data amount, storage performance index and network bandwidth of the file list to be transmitted, extracting a plurality of files to be transmitted, of which the data amount is smaller than a first preset threshold value, from the file list to be transmitted, merging the files to be transmitted into a logic block of a preset data amount in the file list to be transmitted to obtain a target file list to be transmitted, and carrying out fragmentation processing on the target file list to be transmitted according to the target fragmentation number to obtain target fragments. The method comprises the steps of determining a comprehensive load bearing factor according to the product of a storage performance index and a network bandwidth, wherein the storage performance index at least comprises the steps of storing the read-write times per second, determining the descending proportion of the network bandwidth of a current detection period relative to the network bandwidth of a previous adjacent detection period, and calculating the total data quantity, the comprehensive load bearing factor and the descending proportion through a preset fragment decision function to obtain the target fragment quantity, wherein the target fragment quantity and the total data quantity are in positive correlation function relation, and the target fragment quantity and the comprehensive load bearing factor are in negative correlation function relation. The method comprises the steps of calculating total data quantity, comprehensive load bearing factors and descending proportion through a preset slicing decision function to obtain target slicing quantity, calculating the total data quantity, the comprehensive load bearing factors and the descending proportion through the preset slicing decision function to obtain initial slicing quantity, comparing the initial slicing quantity with a preset upper limit value of the slicing quantity, determining a smaller value of the initial slicing quantity and the upper limit value of the slicing quantity as an intermediate result, wherein the upper limit value of the slicing quantity is set