CN-122019128-A - Task processing method and device

CN122019128ACN 122019128 ACN122019128 ACN 122019128ACN-122019128-A

Abstract

The application provides a task processing method and device, and relates to the technical field of artificial intelligence. A task processing method comprises the steps of splitting a target model, inserting a data synchronization operator to obtain a plurality of model blocks, loading the plurality of model blocks into a plurality of acceleration cards, wherein the plurality of acceleration cards are mutually connected in pairs, and carrying out parallel calculation on the task to be inferred based on the plurality of acceleration cards in response to the acquired task to be inferred to obtain an inference result, wherein the calculation comprises corresponding calculation of the model operator in the target model and corresponding calculation of the data synchronization operator. According to the application, through the cooperation of the combination of compiling, reasoning service and hardware and software of the acceleration card, the synchronization of calculation and data is realized in a high-precision and low-delay mode, so that the synchronization cost in the model reasoning process is greatly reduced and the synchronization time is reduced while the correctness is ensured.

Inventors

XIONG XI
Cai Quanxiong
NIU XINYU

Assignees

深圳鲲云信息科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251222

Claims (11)

1. A method for processing a task, comprising: Splitting a target model, and inserting a data synchronization operator to obtain a plurality of model blocks; Loading the plurality of model blocks into a plurality of acceleration cards, wherein the plurality of acceleration cards are mutually connected in pairs; And responding to the acquired tasks to be inferred, and carrying out parallel calculation on the tasks to be inferred based on the plurality of acceleration cards to obtain an inference result, wherein the calculation comprises the corresponding calculation of a model operator in the target model and the corresponding calculation of the data synchronization operator.
2. The method of claim 1, wherein splitting the object model and inserting the data synchronization operator results in a plurality of model blocks, comprising: tensor parallel splitting is carried out on preset weight parameters of the target model, and a splitting result is obtained; And extracting the calculation nodes needing to carry out data synchronization in the splitting result, and inserting data synchronization operators with corresponding functions at the positions of the calculation nodes to obtain the plurality of model blocks.
3. The method of claim 1, wherein any two of the plurality of accelerator cards are communicatively coupled via a hardware interface.
4. The method of claim 2, wherein in response to the acquired task to be inferred, performing parallel computation on the task to be inferred based on the plurality of accelerator cards to obtain an inference result, comprising: responding to the acquired task to be inferred, and writing the task to be inferred into the memories of the plurality of accelerator cards; Based on the acceleration cards, executing the model operators and/or the data synchronization operators in the corresponding model blocks in parallel, and exchanging data when executing the data synchronization operators until traversing the operators in the corresponding model blocks, and outputting a calculation result as the reasoning result.
5. The method according to claim 4, wherein based on the plurality of accelerator cards, executing the model operators and/or the data synchronization operators in the corresponding model blocks in parallel, and exchanging data while executing to the data synchronization operators until traversing the operators in the corresponding model blocks, outputting a calculation result as the reasoning result, comprises: s1, extracting a current operator from operators in the corresponding model block based on the acceleration cards; s2, under the condition that the current operator is a model operator, calling the built-in computing units of the plurality of accelerator cards to complete corresponding computation to obtain a current computing result; s3, calling synchronous calculation units built in the plurality of acceleration cards to complete corresponding calculation under the condition that the current operator is the data synchronous operator, and obtaining a current calculation result; S4, repeating the steps S1-S3 until the acceleration cards traverse operators in the corresponding model blocks, and outputting the current calculation result of the last iteration as the reasoning result.
6. The method according to claim 5, wherein step S3 comprises: S301, under the condition that the current operator is a first data synchronization operator, acquiring first data stored by each of the plurality of accelerator cards as current data, and distributing memory space of data from other accelerator cards based on the remaining memory space of each of the plurality of accelerator cards to obtain a distribution result, wherein the first data synchronization operator comprises ALLGATHER operators; S302, calling a synchronous calculation unit built in the plurality of accelerator cards, and reading the current data of the corresponding accelerator cards according to a pre-configured accelerator card data reading relationship to serve as a first calculation result of the plurality of accelerator cards; S303, updating the current data of the plurality of accelerator cards according to the first calculation result, and storing the first calculation result to a corresponding position according to the distribution result; s304, repeating the steps S302-S303 until the data in the plurality of acceleration cards are completely consistent, and outputting the completely consistent data as the current calculation result.
7. The method according to claim 5, wherein step S3 comprises: s311, under the condition that the current operator is a second data synchronization operator, acquiring second data stored by each of the plurality of acceleration cards as current data, and distributing memory space of data from other acceleration cards based on the remaining memory space of each of the plurality of acceleration cards to obtain a distribution result, wherein the second data synchronization operator comprises AllReduce operators; S312, invoking a synchronous calculation unit built in the plurality of accelerator cards, reading the current data of the corresponding accelerator card according to a pre-configured accelerator card data reading relationship, and executing corresponding calculation of a second data synchronous operator according to the read current data and the second data stored by the plurality of accelerator cards respectively to obtain a second calculation result; S313, updating the current data of the plurality of accelerator cards according to the second calculation result, and storing the second calculation result to a corresponding position according to the distribution result; and S314, repeating the steps S312-S313 until the data in the acceleration cards are completely consistent, and outputting the second calculation result of the last iteration as the current calculation result.
8. The method according to claim 5, wherein step S3 comprises: S321, dividing third data stored by each of the plurality of acceleration cards into a first number of data blocks and numbering the data blocks when the current operator is a third data synchronization operator, wherein the first number is the number of the plurality of acceleration cards, and the third data synchronization operator comprises AllReduce +Scater operators; s322, taking a data block with a first target number stored in each of the plurality of accelerator cards as current data, wherein the first target number is the number of the plurality of accelerator cards; S323, calling a synchronous calculation unit built in the plurality of accelerator cards, and reading the current data in the corresponding accelerator card according to a pre-configured accelerator card data reading relationship to serve as second data to be processed; s324, acquiring data blocks with the same numbers as the current data in the plurality of accelerator cards as first data to be processed; s325, executing corresponding calculation of the third data synchronization operator according to the first data to be processed and the second data to be processed to obtain a third calculation result, and updating the current data of the plurality of accelerator cards according to the third calculation result; S326, storing the third calculation result to the original storage position of the first data to be processed, and repeating the steps S322-S326 until the third calculation result meets a preset condition, wherein the preset condition comprises that full accumulation of data blocks with the same number in the plurality of accelerator cards is completed; S327, calling a synchronous calculation unit built in the plurality of accelerator cards, and reading the latest data meeting the preset conditions in the corresponding accelerator cards according to the accelerator card data reading relationship to serve as third data to be processed; S328, storing the third data to be processed to target storage positions of the plurality of accelerator cards, wherein the target storage positions are storage positions of data blocks with the same number as the third data to be processed in memory spaces of the plurality of accelerator cards; and S329, repeating the steps S327-S328 until the data in the plurality of acceleration cards are completely consistent, and outputting the completely consistent data as the current calculation result.
9. A task processing device, comprising: The splitting module is used for splitting the target model and inserting a data synchronization operator to obtain a plurality of model blocks; The loading module is used for loading the plurality of model blocks into a plurality of acceleration cards, wherein the plurality of acceleration cards are mutually connected in pairs; The computing module is used for responding to the acquired tasks to be inferred, and carrying out parallel computation on the tasks to be inferred based on the acceleration cards to obtain an inference result, wherein the computation comprises the corresponding computation of a model operator in the target model and the corresponding computation of the data synchronization operator.
10. An electronic device, comprising: one or more processors; A memory for storing one or more programs; The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-8.
11. A computer readable storage medium having stored thereon a computer program or instructions which, when executed by a processor, implement the method of any of claims 1-8.

Description

Task processing method and device Technical Field The application relates to the technical field of artificial intelligence, in particular to a task processing method and device. Background At present, the development of AI large model technology and the increasing of model scale, an inference accelerator card can not put down the whole model, and a plurality of accelerator cards cooperatively infer a model to be normal. Since one inference request may involve multiple, diverse data syncs (AllReduce/Broadcast/ALLGATHER), the performance of data collaboration among multiple cards is critical. However, due to communication delay, clock dislocation, time delay caused by software application and the like among a plurality of accelerator cards, the problems of clock asynchronism, inconsistent calculation rhythm and the like exist, and the problems of data synchronization deviation, overlong synchronization waiting time, low calculation efficiency and the like are caused. The traditional mode comprises software synchronization and hardware synchronization, but has certain defects of low software synchronization precision (millisecond level) and high delay, and cannot meet the AI reasoning requirement, and the traditional hardware synchronization needs special hardware support, has complex architecture and high cost, and has a simple synchronization function. Disclosure of Invention Based on the method and the device, the application provides a task processing method and a task processing device, and realizes data synchronization with high precision, low cost and low delay. According to one aspect of the application, a task processing method is provided, which comprises the steps of splitting a target model, inserting a data synchronization operator to obtain a plurality of model blocks, loading the plurality of model blocks into a plurality of acceleration cards, wherein the plurality of acceleration cards are mutually connected in pairs, and carrying out parallel calculation on the task to be inferred based on the plurality of acceleration cards in response to the acquired task to be inferred to obtain an inference result, wherein the calculation comprises the corresponding calculation of the model operator in the target model and the corresponding calculation of the data synchronization operator. According to some embodiments, splitting a target model, inserting a data synchronization operator to obtain a plurality of model blocks, wherein the method comprises the steps of carrying out tensor parallel splitting on preset weight parameters of the target model to obtain a splitting result, extracting calculation nodes needing to carry out data synchronization in the splitting result, and inserting the data synchronization operator with corresponding functions at the positions of the calculation nodes to obtain a plurality of model blocks. According to some embodiments, any two of the plurality of accelerator cards are communicatively coupled via a hardware interface. According to some embodiments, the method comprises the steps of responding to the acquired task to be inferred, carrying out parallel calculation on the task to be inferred based on a plurality of accelerator cards to obtain an inference result, writing the task to be inferred into the memory of the accelerator cards in response to the acquired task to be inferred, carrying out parallel execution of model operators and/or data synchronization operators in corresponding model blocks based on the accelerator cards, carrying out data exchange when the model operators are executed until the operators in the corresponding model blocks are traversed, and outputting the calculation result as the inference result. According to some embodiments, based on a plurality of acceleration cards, executing a model operator and/or a data synchronization operator in a corresponding model block in parallel, and performing data exchange when executing the data synchronization operator until traversing the operator in the corresponding model block, and outputting a calculation result as an inference result, wherein the method comprises the following steps of S1, based on the plurality of acceleration cards, extracting a current operator from the operators in the corresponding model block; S2, calling a built-in computing unit of a plurality of acceleration cards to complete corresponding computation to obtain a current computing result when the current operator is a model operator, S3, calling a built-in synchronous computing unit of the plurality of acceleration cards to complete corresponding computation to obtain the current computing result when the current operator is a data synchronous operator, and S4, repeating S1-S3 until the plurality of acceleration cards traverse the operators in the corresponding model blocks, and outputting the current computing result of the last iteration as an reasoning result. According to some embodiments, the step S3 comprises the ste