CN-119225646-B - Data processing method, electronic device and storage medium
Abstract
The application provides a data processing method, electronic equipment and a storage medium. The method comprises the steps that a processing module receives a first request from a target application program, the first request is used for requesting to read data of a first file, the file size of the first file is smaller than a preset value, the processing module determines whether a cache disk of a client side has the cache file of the first file or not, if the cache file of the first file is determined to be present, the processing module determines whether the cache file is valid, if the cache file of the first file is determined to be valid, the processing module reads the cache file of the first file from the cache disk of the client side to obtain the data of the first file, and the processing module sends the data of the first file to the target application program. The method of the application improves the efficiency of reading the data of a large number of small files.
Inventors
- Zhu Enshui
Assignees
- 中国联合网络通信集团有限公司
- 联通数字科技有限公司
- 联通云数据有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20240912
Claims (9)
- 1. A data processing method, applied to a client, the client including a processing module and a target application, the processing module deploying a client of a lustre file system, the method comprising: The processing module receives a first request from the target application program, wherein the first request is used for requesting to read data of a first file, and the file size of the first file is smaller than a preset value; The processing module determines whether a cache file of the first file exists on a cache disk of the client; If the cache file of the first file is determined to exist, the processing module acquires the latest modification time of the first file on the cache disk of the client and the latest modification time of the first file on the server from the metadata information of the first file; If the latest modification time of the first file on the cache disk of the client is consistent with the latest modification time of the first file on the server, the processing module determines that the cache file of the first file is valid; If the cache file of the first file is determined to be effective, the processing module reads the cache file of the first file from the cache disk of the client so as to acquire the data of the first file; the processing module sends the data of the first file to the target application program; the client also comprises a cache management module, and the method further comprises the following steps: The cache management module receives a third request from a cache management end, wherein the third request comprises file directory information, the file directory information comprises identification of a file directory and identification of a plurality of files under the file directory, and the third request is used for adding data of the plurality of files under the directory to a cache disk of the client in advance; the cache management module acquires data of a plurality of files under the file directory from a server according to the file directory information, and caches the data of the files to a cache disk of the client; The cache disk of the client comprises a memory disk and a nonvolatile memory standard NVME disk, and the method further comprises the step that when the memory space of the memory disk is insufficient, the cache management module caches file data to the NVME disk.
- 2. The method of claim 1, wherein the first request includes an identification of the first file, and wherein the processing module determining whether a cache disk of the client has a cache file of the first file comprises: The processing module searches metadata information of the first file according to the identification of the first file; And if the metadata information of the first file comprises a cache path of the first file on the cache disk of the client, determining that the cache disk of the client is on the cache file of the first file.
- 3. The method according to claim 1 or 2, wherein the processing module reading the cached file of the first file from the client's cached disk to obtain the data of the first file, comprises: the processing module acquires a cache path of the first file on a cache disk of the client from metadata information of the first file; the processing module opens the cache file of the cache disk of the client side of the first file according to the cache path; And the processing module acquires the handle of the cache file, and reads the cache file of the first file from the cache disk of the client according to the handle of the cache file so as to acquire the data of the first file.
- 4. A method according to any one of claims 1 to 3, wherein the processing module determining whether a cache file of the first file exists on a cache disk of the client comprises: The processing module determines whether a first mark is carried in the first request, wherein the first mark is used for indicating writing operation on data of the first file; and if the first mark is not carried in the first request, the processing module determines whether a cache disk of the client side has a cache file of the first file or not.
- 5. The method according to any one of claims 1 to 4, further comprising: the processing module receives a second request from the target application program, wherein the second request is used for requesting to close the first file; The processing module determines whether a cache file of the first file on a cache disk of the client is opened or not; and if the first file is determined to be opened in the cache file of the cache disk of the client, the processing module deletes the handle of the cache file and closes the cache file.
- 6. The method according to any one of claims 1 to 5, further comprising: If the preset condition is met, the processing module acquires a file path of the first file at the server from the metadata information of the first file; The processing module acquires data of the first file from the server according to a file path of the first file at the server; The preset conditions include at least one of the following: The cache disk of the client does not have the cache file of the first file; the cache file of the first file is invalid; The first request carries a first mark.
- 7. The method according to any one of claims 1 to 6, further comprising: When the processing module determines that the latest modification time of the first file on the cache disk of the client is inconsistent with the latest modification time of the first file on the server, the processing module sends indication information to the cache management module; The cache management module executes at least one of the following according to the indication information: deleting the cache file of the first file on the cache disk of the client from the cache disk of the client; and acquiring the latest data of the first file from the server, and caching the latest data of the first file to a cache disk of the client.
- 8. An electronic device comprising a processor and a memory communicatively coupled to the processor; the memory stores computer-executable instructions; the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1 to 7.
- 9. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1 to 7.
Description
Data processing method, electronic device and storage medium Technical Field The present application relates to the field of data reading, and in particular, to a data processing method, an electronic device, and a storage medium. Background Lustre is a high performance distributed file system commonly used in large computing clusters and in high performance computing (High Performance Computing, HPC) environments. In the Lustre file system, metadata and data are stored and managed separately. Data is stored in multiple object storage targets (Object Storage Object, OST) of Lustre, which can support Lustre to store very large files and provide high-speed concurrent read-write performance. However, in a scenario such as model training, data of a huge amount of small files needs to be read, the size of the small files is generally smaller than the bandwidth of a single I/O (Input/Output) access, and frequent reading of data through the I/O may cause waste of network bandwidth resources, resulting in low transmission performance of the lustre file system. Therefore, there is a problem that the data efficiency of reading a large number of small files is low. Disclosure of Invention The application provides a data processing method, electronic equipment and a storage medium, which are used for solving the technical problem of low data efficiency of reading massive small files. In a first aspect, the application provides a data processing method, comprising the steps that a processing module receives a first request from a target application program, wherein the first request is used for requesting to read data of a first file; the processing module determines whether a cache disk of the client side has a cache file of the first file; if the cache file of the first file exists, the processing module determines whether the cache file is valid; If the cache file of the first file is determined to be effective, the processing module reads the cache file of the first file from a cache disk of the client side so as to acquire data of the first file; the processing module sends the data of the first file to the target application. Optionally, in the method, the first request includes an identification of the first file, and the processing module determines whether a cache disk of the client has a cache file of the first file, including: the processing module searches metadata information of the first file according to the identification of the first file; if the metadata information of the first file includes a cache path of the first file on a cache disk of the client, determining that the cache disk of the client is in a cache file of the first file. Optionally, in the method, the processing module determines whether the cached file of the first file is valid, including: The processing module acquires the latest modification time of the first file on the cache disk of the client and the latest modification time of the first file on the server from the metadata information of the first file; if the latest modification time of the first file on the cache disk of the client is consistent with the latest modification time of the first file on the server, the processing module determines that the cache file of the first file is valid. Optionally, in the above method, the processing module reads a cache file of the first file from a cache disk of the client to obtain data of the first file, including: the processing module acquires a cache path of the first file on a cache disk of the client from metadata information of the first file; the processing module opens a cache file of the first file on a cache disk of the client according to the cache path; The processing module acquires the handle of the cache file, and reads the cache file of the first file from the cache disk of the client according to the handle of the cache file so as to acquire the data of the first file. Optionally, in the method, the processing module determines whether a cache disc of the client has a cache file of the first file, including: the processing module determines whether a first request carries a first mark or not, wherein the first mark is used for indicating writing operation on data of a first file; If the first request is determined not to carry the first mark, the processing module determines whether a cache disk of the client side has a cache file of the first file. Optionally, as in the above method, the method further comprises: the processing module receives a second request from the target application program, wherein the second request is used for requesting to close the first file; The processing module determines whether a cache file of a cache disk of the client side of the first file is opened; If the first file is determined to be opened in the cache file of the cache disk of the client, the processing module deletes the handle of the cache file and closes the cache file. Optionally, as in the above method, the method further