Search

WO-2026092475-A1 - DATA PROCESSING METHOD, SYSTEM AND APPARATUS

WO2026092475A1WO 2026092475 A1WO2026092475 A1WO 2026092475A1WO-2026092475-A1

Abstract

Provided in embodiments of the present disclosure are a data processing method, system and apparatus. The method comprises: receiving a plurality of data processing requests, wherein the plurality of data processing requests carry virtual file identifiers and data processing content corresponding to the data processing requests; on the basis of the virtual file identifiers, determining virtual files corresponding to the data processing requests and target file sets corresponding to the virtual files, wherein each target file set is a set of physical files in a target data storage system, at least two virtual files among the plurality of virtual files correspond to a same target file set, and one target file set corresponds to one file handle; and using the file handles of the target file sets to open the target file sets, and executing the plurality of data processing requests on the basis of the data processing content, the virtual files, and the target file sets corresponding to the virtual files. When a large number of files are processed, one (or a small number of) files are actually written into a file system, thereby reducing resource occupation and supporting more concurrent processing.

Inventors

  • WANG, XIANG
  • LI, Wentan
  • DAI, MIN
  • ZHAO, Baiqiang
  • SHEN, Chunhui
  • ZHANG, WEI

Assignees

  • 云智能资产控股(新加坡)私人股份有限公司
  • 杭州阿里云飞天信息技术有限公司

Dates

Publication Date
20260507
Application Date
20251028
Priority Date
20241101

Claims (20)

  1. A data processing method, comprising: Receive multiple data processing requests, wherein each of the multiple data processing requests carries a corresponding virtual file identifier and data processing content; Based on the virtual file identifier, the virtual file corresponding to the data processing request and the target file set corresponding to the virtual file are determined. The target file set is a set of physical files in the target data storage system, and at least two virtual files in the multiple virtual files correspond to the same target file set. One target file set corresponds to one file handle. The target file set is opened using the file handle of the target file set, and the multiple data processing requests are executed according to the data processing content, the virtual file, and the target file set corresponding to the virtual file.
  2. According to the data processing method of claim 1, the plurality of data processing requests are plurality of data write requests, and the data processing content is data to be written; Accordingly, executing the multiple data processing requests based on the data processing content, the virtual file, and the target file set corresponding to the virtual file includes: Based on the correspondence between the virtual file and the data processing request, and the association between the data processing request and the data to be written, the data to be written corresponding to the virtual file is obtained; The data to be written corresponding to the virtual file is encapsulated to obtain the data encapsulation package corresponding to the virtual file; Write the data package corresponding to the virtual file into the target buffer corresponding to the target file set corresponding to the virtual file; Under the condition that the preset writing conditions are met, the data package to be written to the target buffer is written to the target file set corresponding to the target buffer.
  3. According to the data processing method of claim 1 or 2, before determining the virtual file corresponding to the data processing request and the target file set corresponding to the virtual file based on the virtual file identifier, the method further includes: Receive the creation request of the virtual file, wherein the creation request carries the file path of the virtual file; Based on the file path, create the virtual file and assign a corresponding set of target files to the virtual file; Write the correspondence between the virtual file and the target file set corresponding to the virtual file into the target storage location, and determine the corresponding virtual file identifier for the virtual file.
  4. According to the data processing method of claim 2 or 3, under the condition of satisfying the preset writing conditions, the data package to be written to the target buffer is written to the target file set corresponding to the target buffer, including: The physical files that are in the open state in the target file set are identified as target files, wherein the open state is used to indicate that the physical file has been opened and data can be written, and the file handle corresponding to the target file set is the file handle corresponding to the target file; Under the condition that the preset writing conditions are met, the data of the data package in the target buffer is written to the target file in the target file set corresponding to the target buffer using the writing thread corresponding to the target file set.
  5. The data processing method according to any one of claims 2-4, after writing the data package to the target buffer into the target file set corresponding to the target buffer under the condition of satisfying the preset writing conditions, further includes: If the data package written to the target buffer is successfully written to the target file set corresponding to the target buffer, the flashing thread calls the flashing interface to store the data package successfully written to the target file set to persistent storage medium. The flashing thread is the thread corresponding to the target file set, and the flashing interface is the interface provided by the target data storage system that corresponds to the target file set.
  6. According to the data processing method of claim 5, the step of successfully writing the data package written to the target buffer into the target file set corresponding to the target buffer includes: Determine the amount of data to be written in the data encapsulation packet written to the target buffer; Based on the amount of data to be written in the data encapsulation packet, update the written points corresponding to the target file set corresponding to the target buffer; Upon completion of the update, it is confirmed that the data package to be written to the target buffer has been successfully written to the target file set corresponding to the target buffer. The written points are used to record the amount of data to be written to the target file set, and are updated incrementally according to the amount of data to be written to the target file set successfully.
  7. According to the data processing method of claim 6, the step of using the flashing thread to call the flashing interface and storing the data package successfully written to the target file set into a persistent storage medium includes: The target file set is used to monitor the written points and triggered flash points, where the triggered flash points are used to record the amount of data to be written that has been flashed into the persistent storage medium. Based on the written points and the triggered flashing points, if it is determined that the data package successfully written to the target file set has not been stored in the persistent storage medium, the flashing interface is called to store the data package successfully written to the target file set in the persistent storage medium.
  8. The data processing method according to any one of claims 5-7, after storing the data package successfully written to the target file set into a persistent storage medium, further includes: Based on the callback function corresponding to the data package that is persistently stored in the persistent storage medium, delete the data package in the target buffer corresponding to the target file set.
  9. The data processing method according to any one of claims 4-8 further includes: If the target file meets the preset switching conditions, a new target file is created in the target file set corresponding to the target file, and the open state of the target file is updated to the closed state to obtain the physical file in the closed state. The preset switching conditions are that the amount of data in the target file is greater than a preset threshold, or the creation time of the target file exceeds a preset time. The closed state is used to indicate that the target file is no longer allowed to write data. If the physical files in the closed state meet the preset recycling conditions, the file data in the physical files in the closed state will be recycled.
  10. According to the data processing method of claim 9, the step of reclaiming file data in the closed physical file when it is determined that the closed physical file meets the preset reclamation conditions includes: If it is determined that the physical file in the closed state meets the preset recycling conditions, a new target file is created in the target file set corresponding to the physical file in the closed state, and the physical file in the closed state is determined as a file to be recycled; The non-junk data in the file to be recycled is written into the new target file, and the new target file is used to replace the file to be recycled. The non-junk data is the file data in the file to be recycled excluding the junk data, and the junk data is the deleted file data.
  11. According to any one of claims 4-10, the data processing method, wherein the step of using the writing thread corresponding to the target file set to write the data of the data package in the target buffer to the target file in the target file set corresponding to the target buffer, comprises: Using the writing thread corresponding to the target file in the target file set, the data package to be written to the target buffer is written to the target file in the target file set corresponding to the target buffer; If the write thread corresponding to a target file in the target file set encounters an error, a new target file is created in the target file set. The data package corresponding to the target buffer in the target file set and the new target file is written to the new target file using the write thread corresponding to the new target file.
  12. The data processing method according to any one of claims 2-11 further includes: The data to be written corresponding to the virtual file is written to the memory buffer and identified as mirror data; If the mirrored data in the memory buffer exceeds a preset threshold, the mirrored data is rearranged to obtain rearranged data; When the target file is updated to a closed state, the data to be written in the data encapsulation package written in the target file is replaced with the rearranged data.
  13. According to any one of claims 1-12, the plurality of data processing requests are plurality of data reading requests, and the data processing content is the data information of the data to be read; Accordingly, executing the multiple data processing requests based on the data processing content, the virtual file, and the set of target files corresponding to the virtual file includes: Based on the data information of the data to be read, determine the target data encapsulation package information corresponding to the data to be read in the virtual file; Determine the set of target files corresponding to the virtual file, and determine the physical file corresponding to the virtual file from the set of target files corresponding to the virtual file; Read the target data package corresponding to the target data package information from the physical file corresponding to the virtual file; Based on the data information of the data to be read, the data to be read is read from the target data package.
  14. According to the data processing method of claim 13, the target data package information includes the virtual file corresponding to the data package, the data offset corresponding to the virtual file, and the data offset of the target file corresponding to the virtual file.
  15. According to the data processing method of claim 14, the step of reading the target data package corresponding to the target data package information from the target file corresponding to the virtual file includes: Based on the target data package information, determine the target file corresponding to the target data package and the data offset of the target file; Based on the data offset of the target file, the target data package at the data offset position is read from the target file corresponding to the virtual file.
  16. A data processing method, applied to log file scenarios, includes: Receive multiple log data processing requests, wherein each of the multiple data processing requests carries a corresponding virtual log file identifier and log data processing content; Based on the virtual log file identifier, the virtual log file corresponding to the log data processing request and the target log file set corresponding to the virtual log file are determined. The target log file set is a set of physical log files in the target data storage system, and at least two virtual log files correspond to the same target log file set. One target log file set corresponds to one file handle. The target log file set is opened using the file handle of the target log file set, and the multiple log data processing requests are executed according to the log data processing content, the virtual log file, and the target log file set corresponding to the virtual log file.
  17. A data storage system, comprising: The receiving unit is used to receive multiple data processing requests, wherein each of the multiple data processing requests carries a corresponding virtual file identifier and data processing content; The determining unit is configured to determine, based on the virtual file identifier, the virtual file corresponding to the data processing request and the target file set corresponding to the virtual file, wherein the target file set is a set of physical files in the target data storage system, and at least two virtual files in the multiple virtual files correspond to the same target file set, and one target file set corresponds to one file handle; An execution unit is configured to open the target file set using the file handles of the target file set, and execute the multiple data processing requests according to the data processing content, the virtual file, and the target file set corresponding to the virtual file.
  18. A data processing apparatus, comprising: The receiving module is configured to receive multiple data processing requests, wherein each of the multiple data processing requests carries a corresponding virtual file identifier and data processing content; The determination module is configured to determine, based on the virtual file identifier, the virtual file corresponding to the data processing request and the target file set corresponding to the virtual file, wherein the target file set includes at least one physical file in the target data storage system, and at least two virtual files in the multiple virtual files correspond to the same target file set, and one target file set corresponds to one file handle; The execution module is configured to open the target file set using the file handle of the target file set, and execute the multiple data processing requests according to the data processing content, the virtual file, and the target file set corresponding to the virtual file.
  19. A computing device, comprising: Memory and processor; The memory is used to store computer programs/instructions, and the processor is used to execute the computer programs/instructions, which, when executed by the processor, implement the steps of the data processing method according to any one of claims 1 to 15.
  20. A computer-readable storage medium storing a computer program/instructions that, when executed by a processor, implement the steps of the data processing method according to any one of claims 1 to 15.

Description

Data processing methods, systems and devices This disclosure claims priority to Chinese Patent Application No. 202411554935.4, filed with the China Patent Office on November 1, 2024, entitled “Data Processing Method, System and Apparatus”, the entire contents of which are incorporated herein by reference. Technical Field This disclosure relates to the field of file system technology, and in particular to a data processing method, system, and apparatus. Background Technology Distributed databases often shard tables into regions, with different shards distributed across different machines to enable multi-machine collaborative services. Even a single machine may have multiple shards. Because a large amount of concurrent file processing (including concurrent writes and reads, each write or read requires opening the file) can lead to insufficient file handle resources, HBase (an open-source, distributed storage system) uses HDFS (Hadoop Distributed File System) for log writing. Each time HDFS opens a file, it consumes several TCP (Transmission Control Protocol) threads. To address this issue, single-machine databases often write multiple sharded WAL (Write-Ahead Logging, a transaction log in databases; in database management systems, WAL is a logging mechanism used to ensure data consistency and durability. When the database executes a transaction, all changes are first recorded in a WAL log file before updating the main data store) to a single file. This reduces the consumption of file handle resources. However, during error recovery, the log file needs to be split according to the shards before data recovery can proceed. This not only increases data recovery time but also introduces additional complexity to the database itself. Therefore, this method is not suitable for solving the problem of wasted file handle resources during concurrent file writes. Summary of the Invention In view of this, the present disclosure provides two data processing methods. One or more embodiments of the present disclosure also relate to a data processing system, a data processing apparatus, a computing device, a computer-readable storage medium, and a computer program product, to solve the technical defects in the prior art where concurrent processing of multiple files causes insufficient file handle resources. According to a first aspect of the present disclosure, a data processing method is provided, comprising: Receive multiple data processing requests, wherein each of the multiple data processing requests carries a corresponding virtual file identifier and data processing content; Based on the virtual file identifier, the virtual file corresponding to the data processing request and the target file set corresponding to the virtual file are determined. The target file set is a set of physical files in the target data storage system, and at least two virtual files in the multiple virtual files correspond to the same target file set. One target file set corresponds to one file handle. The target file set is opened using the file handle of the target file set, and the multiple data processing requests are executed according to the data processing content, the virtual file, and the target file set corresponding to the virtual file. According to a second aspect of the present disclosure, a data processing method is provided, applied to a log file scenario, including: Receive multiple log data processing requests, wherein each of the multiple data processing requests carries a corresponding virtual log file identifier and log data processing content; Based on the virtual log file identifier, the virtual log file corresponding to the log data processing request and the target log file set corresponding to the virtual log file are determined. The target log file set is a set of physical log files in the target data storage system, and at least two virtual log files correspond to the same target log file set. One target log file set corresponds to one file handle. The target log file set is opened using the file handle of the target log file set, and the multiple log data processing requests are executed according to the log data processing content, the virtual log file, and the target log file set corresponding to the virtual log file. According to a third aspect of the present disclosure, a data storage system is provided, comprising: The receiving unit is used to receive multiple data processing requests, wherein each of the multiple data processing requests carries a corresponding virtual file identifier and data processing content; The determining unit is configured to determine, based on the virtual file identifier, the virtual file corresponding to the data processing request and the target file set corresponding to the virtual file, wherein the target file set is a set of physical files in the target data storage system, and at least two virtual files in the multiple virtual files correspond to the same target file set, and one