CN-122019162-A - Data processing method and device
Abstract
The embodiment of the application provides a data processing method, a data processing device, computer equipment, a computer readable storage medium and a computer program product, and belongs to the technical field of distributed computing. The data processing method comprises the steps of updating the multi-modal data into a target column under the condition that the multi-modal data is processed and generated by utilizing a target distributed computing framework and corresponds to the target column of the column storage system, storing the multi-modal data into an added column of the column storage system under the condition that the multi-modal data does not correspond to any column of the column storage system, and when the target multi-modal data needs to be read, reading the multi-modal data from the column storage system by utilizing the target distributed computing framework, and writing the multi-modal data into the column storage system after processing. The technical scheme of the embodiment of the application can reduce the problems of inconsistent data, disordered management, low retrieval efficiency and the like caused by storing the multi-mode data to different storage media, and improves the use efficiency and the resource utilization efficiency of the data.
Inventors
- ZHENG ZHISHENG
Assignees
- 上海哔哩哔哩科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260128
Claims (13)
- 1. A method of data processing, the method comprising: Updating the multi-modal data into a target column of a columnar storage system under the condition that the multi-modal data is generated by utilizing target distributed computing framework processing, wherein the multi-modal data corresponds to the target column, the columnar storage system supports storage of heterogeneous data, and the target distributed computing framework supports streaming processing and batch processing; Storing the multi-modal data into a new column of the columnar storage system under the condition that the multi-modal data does not correspond to any column of the columnar storage system; Under the condition that target multi-modal data needs to be acquired, the target multi-modal data is read from the column-type storage system by utilizing a target distributed computing framework, and after the target multi-modal data is processed, the processed result of the target multi-modal data is written into the column-type storage system.
- 2. The method according to claim 1, wherein in the case where the target multi-modal data needs to be acquired, reading the target multi-modal data from the columnar storage system by using a target distributed computing framework, after processing the target multi-modal data, writing a result after the target multi-modal data processing into the columnar storage system, including: Under the condition that target multi-modal data need to be acquired, reading target data fragments from the columnar storage system by utilizing the target distributed computing framework, wherein the target data fragments are one of the target multi-modal data, and the target data fragments comprise target fragment IDs; Performing target calculation operation on the target data fragments according to the target fragment IDs by using the target distributed calculation framework to obtain operation results; and merging the operation results based on the target fragment ID and writing the merged operation results into the columnar storage system.
- 3. The method according to claim 2, wherein performing, by the target distributed computing framework, a target computing operation on the target data slice according to the target slice ID to obtain an operation result, includes: converting, with the target distributed computing framework, the first data of the target data slice from a column format to second data of a dictionary data structure; Performing target computing operation on the second data according to the target fragment ID by using the target distributed computing framework to obtain an operation result; correspondingly, the step of merging the operation results based on the target fragment ID and writing the merged operation results into the columnar storage system comprises the following steps: converting the operation result back into third data in column format by using the target distributed computing framework; And merging the third data based on the target fragment ID and writing the merged third data into the columnar storage system.
- 4. The method of claim 3, wherein performing the target computing operation on the second data according to the target shard ID using the target distributed computing framework comprises: invoking a pre-configured data processing pipeline by using the target distributed computing framework, wherein the data processing pipeline is configured by a data serialization language; And performing target calculation operation on the second data according to the target fragment ID through the data processing pipeline.
- 5. The method of claim 4, wherein the data processing pipeline comprises a normalization operator, the normalization operator comprising a generic operator and a custom operator, the generic operator being an operator pre-built in the target distributed computing framework, the custom operator being an operator configured using an operator normalization interface in the target distributed computing framework, the configuration of the operator normalization interface comprising an operator initialization configuration, an operator input, an operator processing logic, and an operator output, the operator processing logic comprising a single process and a batch process; correspondingly, the calling a pre-configured data processing pipeline by using the target distributed computing framework, and performing target computing operation on the second data according to the target fragment ID through the data processing pipeline, including: Acquiring the configuration of the standardized operator; invoking the normalization operator based on a configuration of the normalization operator; And performing target calculation operation on the second data according to the target fragment ID through the standardization operator.
- 6. The method of claim 4, wherein the configuration of the data processing pipeline comprises an execution engine type, an engine execution mode, a data processing mode, a data input configuration, an operator concatenation configuration, and a data output configuration, the engine execution mode comprising a cluster mode and a local mode, the data processing mode comprising a bulk processing mode and a stream processing mode; correspondingly, the invoking the pre-configured data processing pipeline with the target distributed computing framework comprises: Acquiring the configuration of the data pipeline; invoking the data pipeline based on a configuration of the data pipeline using the target distributed computing framework.
- 7. The method of claim 6, wherein the configuration of the data processing pipeline further comprises breakpoint resume, flexible scaling, queue capacity control, and block size control, in the case where the data processing mode is the stream processing mode.
- 8. The method of claim 2, wherein the reading the target data shard from the columnar storage system with the target distributed computing framework comprises: Pre-reading a target data fragment from the columnar storage system by using a data loader; Caching the target data fragments into a distributed object store of the target distributed computing framework; pulling the target data fragments from the distributed object storage according to a preset data reading mode by using the data loader, and filling the target data fragments into a local cache queue; and acquiring the target data fragments from the cache queue by using the target distributed computing framework.
- 9. The method of claim 8, wherein the configuration of the data loader comprises a read path from the columnar-memory system, a read field column, data filtering conditions, a total number of samples read, a distributed training configuration, pre-read batch data, a pattern of data reading, a data conversion, a hardware adaptation, the pattern of data reading comprising asynchronous reading and streaming reading; correspondingly, the method for pre-reading the target data fragments from the columnar storage system by using the data loader comprises the following steps: acquiring the configuration of the data loader, and calling the data loader to read target data fragments from the columnar storage system in advance according to the configuration of the data loader; The pulling, by the data loader, the target data fragment from the distributed object store according to a preset data read mode, including: and determining a data reading mode according to the configuration of the data loader, and pulling the target data fragments from the distributed object storage according to the configured data reading mode.
- 10. A data processing apparatus, the apparatus comprising: The system comprises an updating module, a target distributed computing framework and a storage module, wherein the updating module is used for updating the multi-modal data into a target column of a column storage system under the condition that the multi-modal data is generated by utilizing the target distributed computing framework, the multi-modal data corresponds to the target column, the column storage system supports the storage of heterogeneous data, and the target distributed computing framework supports stream processing and batch processing; The storage module is used for storing the multi-mode data into a new column of the column storage system under the condition that the multi-mode data does not correspond to any column of the column storage system; and the processing module is used for reading the target multi-modal data from the column storage system by utilizing a target distributed computing framework under the condition that the target multi-modal data is required to be acquired, processing the target multi-modal data, and writing the processed result of the target multi-modal data into the column storage system.
- 11. A computer device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein: The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.
- 12. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein computer instructions which, when executed by a processor, implement the method of any of claims 1 to 9.
- 13. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 9.
Description
Data processing method and device Technical Field Embodiments of the present application relate to the field of distributed computing technology, and in particular, relate to a data processing method, apparatus, computer device, computer readable storage medium, and computer program product. Background With the development of large model technology, many large model related services typically generate large amounts of multi-modal data, and because these multi-modal data formats are different, they are typically stored in different storage systems. However, storing multi-modal data in different storage systems easily causes inconsistent data, and the data management is confusing, so that if the multi-modal data is to be used later, the multi-modal data needs to be fetched from multiple storage media, and the efficiency is low. On the other hand, since large models typically require extensive computation, single point computing has limited performance, and therefore requires the use of distributed computing to meet the requirements. Under the condition of using distributed computing, a large amount of network transmission overhead is generated when multiple nodes call data across storage media, and the data versions of different nodes are inconsistent, so that the computing result deviation is easy to cause, and the problems of inconsistent data, disordered data management, low call efficiency and the like are more remarkable. It should be noted that the foregoing is not necessarily prior art, and is not intended to limit the scope of the present application. Disclosure of Invention Embodiments of the present application provide a data processing method, apparatus, computer device, computer readable storage medium, computer program product, to solve or alleviate one or more of the technical problems set forth above. An aspect of an embodiment of the present application provides a data processing method, including: Updating the multi-modal data into a target column of a columnar storage system under the condition that the multi-modal data is generated by utilizing target distributed computing framework processing, wherein the multi-modal data corresponds to the target column, the columnar storage system supports storage of heterogeneous data, and the target distributed computing framework supports streaming processing and batch processing; Storing the multi-modal data into a new column of the columnar storage system under the condition that the multi-modal data does not correspond to any column of the columnar storage system; Under the condition that target multi-modal data needs to be acquired, the target multi-modal data is read from the column-type storage system by utilizing a target distributed computing framework, and after the target multi-modal data is processed, the processed result of the target multi-modal data is written into the column-type storage system. Optionally, in the case where the target multi-modal data needs to be acquired, reading the target multi-modal data from the columnar storage system by using a target distributed computing framework, after processing the target multi-modal data, writing a result after processing the target multi-modal data into the columnar storage system, including: Under the condition that target multi-modal data need to be acquired, reading target data fragments from the columnar storage system by utilizing the target distributed computing framework, wherein the target data fragments are one of the target multi-modal data, and the target data fragments comprise target fragment IDs; Performing target calculation operation on the target data fragments according to the target fragment IDs by using the target distributed calculation framework to obtain operation results; and merging the operation results based on the target fragment ID and writing the merged operation results into the columnar storage system. Optionally, the performing, by using the target distributed computing framework, a target computing operation on the target data slice according to the target slice ID to obtain an operation result, including: converting, with the target distributed computing framework, the first data of the target data slice from a column format to second data of a dictionary data structure; Performing target computing operation on the second data according to the target fragment ID by using the target distributed computing framework to obtain an operation result; correspondingly, the step of merging the operation results based on the target fragment ID and writing the merged operation results into the columnar storage system comprises the following steps: converting the operation result back into third data in column format by using the target distributed computing framework; And merging the third data based on the target fragment ID and writing the merged third data into the columnar storage system. Optionally, the performing, by using the target distributed computing fra