Search

CN-121979964-A - Data loading method and device and electronic equipment

CN121979964ACN 121979964 ACN121979964 ACN 121979964ACN-121979964-A

Abstract

The application is applicable to the technical field of data processing and provides a data loading method, a data loading device and electronic equipment, wherein the data loading method comprises the steps of extracting text question-answer pairs, source paths of visual data and meta-information of the visual data in a file of a first data set; the method comprises the steps of serializing text question-answer pairs into binary data to obtain text binary data, reading corresponding visual data from a source path, converting the read visual data and meta-information of the visual data into binary data to obtain visual binary data, storing the text binary data and the visual binary data to obtain a second data set, reading the text binary data and the visual binary data in the second data set, deserializing the read text binary data into character strings, and converting the character strings and the visual binary data into content formats supported by a target frame. By the method, the loading efficiency of the data is improved.

Inventors

  • CHEN CHAOFENG

Assignees

  • 深圳市优必选科技股份有限公司

Dates

Publication Date
20260505
Application Date
20251224

Claims (11)

  1. 1. A method of loading data, comprising: Reading a file of a first data set, and extracting text question-answer pairs, a source path of visual data and meta-information of the visual data in the file of the first data set, wherein the first data set is a data set in a first format, and comprises the visual data and the text question-answer pairs corresponding to the visual data; Serializing the text question-answer pair into binary data to obtain text binary data; reading corresponding visual data from the source path, and converting the read visual data and meta-information of the visual data into binary data to obtain visual binary data; Storing the text binary data and the visual binary data to obtain a second data set, wherein the second data set is a data set in a second format, and the copying cost of the data set in the second format to I/O is smaller than that of the data set in the first format to I/O; Reading the text binary data and the visual binary data in the second data set, de-serializing the read text binary data into character strings, and converting the character strings and the visual binary data into content formats supported by a target frame.
  2. 2. The data loading method of claim 1, wherein said storing said text binary data and said visual binary data to obtain a second data set comprises: Determining a unique identification of the visual data; storing the text binary data corresponding to the unique identification of the visual data, and storing the unique identification of the visual data corresponding to the visual binary data to obtain the second data set; The reading the text binary data and the visual binary data in the second data set includes: reading the text binary data in the second data set and the unique identification of the visual data; Reading the visual binary data in the second data set according to the unique identification of the visual data.
  3. 3. The data loading method of claim 1, wherein the visual data comprises video data, the de-serializing the read text binary data into a string, and converting the string and the visual binary data into a content format supported by a target frame, comprising: analyzing meta information of the video data from information of a designated part of the video binary data, wherein the designated part is a part for storing the meta information; Decoding the video binary data into a sequence of frames according to meta information of the video data; Converting the sequence of frames into a video tensor; and converting the character string and the video tensor into a content format supported by a target framework.
  4. 4. The data loading method of claim 1, wherein the reading the text binary data and the visual binary data in the second data set, de-serializing the read text binary data into a character string, and converting the character string and the visual binary data into a content format supported by a target frame, comprises: Reading the text binary data and the visual binary data in the second data set in batches in a streaming processing mode, and temporarily storing the read text binary data and the read visual binary data in a buffer area; Reading the text binary data of the buffer and inversely sequencing the text binary data into character strings, and converting the character strings and the visual binary data read from the buffer into a content format supported by a target frame.
  5. 5. The data loading method according to any one of claims 1 to 4, wherein the reading the text binary data and the visual binary data in the second data set, de-serializing the read text binary data into character strings, and converting the character strings and the visual binary data into content formats supported by a target frame, comprises: Dividing a file in the second data set into a plurality of continuous small blocks, wherein the file in the second data set comprises the text binary data and the visual binary data; and for each small block, reading the text binary data and the visual binary data of the small block by adopting a corresponding subprocess, de-serializing the read text binary data into character strings, and converting the character strings and the visual binary data into a content format supported by a target frame.
  6. 6. The data loading method of claim 4, wherein the second data set includes a validation set in a second format, wherein the reading the text binary data and the visual binary data in the second data set in batches and buffering the read text binary data and visual binary data in a buffer in a pipelined manner comprises: Determining a number of samples of the validation set; and when the number of the samples is greater than a preset number threshold, reading the text binary data and the visual binary data in the verification set in batches in a streaming processing mode, and temporarily storing the read text binary data and the read visual binary data in a buffer area.
  7. 7. The data loading method of claim 6, wherein the reading the text binary data and the visual binary data in the second data set, de-serializing the read text binary data into a character string, and converting the character string and the visual binary data into a content format supported by a target frame, comprises: and when the number of the samples is not greater than a preset number threshold, preprocessing the text binary data and the visual binary data in a non-streaming processing mode to obtain a serialization file, wherein the serialization file comprises contents in a content format supported by the target frame.
  8. 8. A data loading device, the data loading device comprising: The device comprises a file reading module of a first data set, a text query and answer pair extracting module and a text query and answer pair extracting module, wherein the file reading module is used for reading a file of the first data set and extracting text query and answer pairs, a source path of visual data and meta information of the visual data in the file of the first data set, the first data set is a data set in a first format, and the first data set comprises the visual data and the text query and answer pairs corresponding to the visual data; the text binary data determining module is used for serializing the text question-answer pair into binary data to obtain text binary data; the visual binary data determining module is used for reading corresponding visual data from the source path, and converting the read visual data and meta-information of the visual data into binary data to obtain visual binary data; the second data set determining module is used for storing the text binary data and the visual binary data to obtain a second data set, wherein the second data set is a data set in a second format, and the copying cost of the data set in the second format to the I/O is smaller than that of the data set in the first format to the I/O; And the file reading module of the second data set is used for reading the text binary data and the visual binary data in the second data set, de-serializing the read text binary data into character strings, and converting the character strings and the visual binary data into content formats supported by a target framework.
  9. 9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 7 when the computer program is executed.
  10. 10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 7.
  11. 11. A computer program product comprising a computer program which, when run, causes the method of any one of claims 1 to 7 to be performed.

Description

Data loading method and device and electronic equipment Technical Field The present application relates to the field of data processing technology, and in particular, to a data loading method, apparatus, electronic device, computer readable storage medium, and computer program product. Background In the process of penetrating artificial intelligence technology into scenes such as body intelligence, the requirements of deep learning models (such as natural language processing models, computer vision models and multi-modal interaction models) on the data loading efficiency are increasingly improved. Wherein the data loading link is used as a pre-link of model training, the efficiency of the data loading link directly influences the whole training period, especially when processing multi-mode data of a large-scale data set such as images, videos, texts and the like with the sample level of tens of millions, the inefficient data loading efficiency may cause computing resources (such as the chinese holly called graphics processing units or graphics processors (Graphics Processing Unit, GPU) etc.) to be in a waiting state, seriously affecting training efficiency, forming a core bottleneck for "computing empty etc. data. In current model training practice, data is typically read using a data loading pipeline built into the deep learning framework. However, when the memory format adopted by the training data causes higher Input/Output (I/O) copy overhead, a data loading bottleneck is formed, so that the hardware utilization rate is reduced due to the waiting of the GPU/accelerator, and the overall training throughput is finally restricted. For example, when a scalable lightweight fine Tuning infrastructure (Scalable lightWeight Infrastructure for Fine-Tuning, abbreviated as "Swift") processes a large-scale data set, a main solution is to load a text format file such as JSON line (JSON Lines, JSONL) format, but because JSONL adopts a manner of parsing text data line by line, the data needs to be copied between a disk and a memory multiple times during reading, so that hardware I/O potential is difficult to release, and therefore, the requirement of a large model on the loading efficiency of the large-scale data set cannot be met. Disclosure of Invention The embodiment of the application provides a data loading method, a data loading device and electronic equipment, which can solve the problem that the data loading efficiency of the existing target frame is lower when data is loaded. In a first aspect, an embodiment of the present application provides a data loading method, including: Reading a file of a first data set, and extracting text question-answer pairs, a source path of visual data and meta-information of the visual data in the file of the first data set, wherein the first data set is a data set in a first format, and comprises the visual data and the text question-answer pairs corresponding to the visual data; Serializing the text question-answer pair into binary data to obtain text binary data; reading corresponding visual data from the source path, and converting the read visual data and meta-information of the visual data into binary data to obtain visual binary data; Storing the text binary data and the visual binary data to obtain a second data set, wherein the second data set is a data set in a second format, and the copying cost of the data set in the second format to I/O is smaller than that of the data set in the first format to I/O; Reading the text binary data and the visual binary data in the second data set, de-serializing the read text binary data into character strings, and converting the character strings and the visual binary data into content formats supported by a target frame. In a second aspect, an embodiment of the present application provides a data loading apparatus, including: The device comprises a file reading module of a first data set, a text query and answer pair extracting module and a text query and answer pair extracting module, wherein the file reading module is used for reading a file of the first data set and extracting text query and answer pairs, a source path of visual data and meta information of the visual data in the file of the first data set, the first data set is a data set in a first format, and the first data set comprises the visual data and the text query and answer pairs corresponding to the visual data; the text binary data determining module is used for serializing the text question-answer pair into binary data to obtain text binary data; the visual binary data determining module is used for reading corresponding visual data from the source path, and converting the read visual data and meta-information of the visual data into binary data to obtain visual binary data; the second data set determining module is used for storing the text binary data and the visual binary data to obtain a second data set, wherein the second data set is a data set in a second fo