CN-115713101-B - Parallel processing method and device

CN115713101BCN 115713101 BCN115713101 BCN 115713101BCN-115713101-B

Abstract

The embodiment of the application discloses a parallel processing method and device, and belongs to the technical field of computers. In the embodiment of the application, the weight of each computing core is determined according to the computing power of each computing core in a plurality of computing cores, and the data to be processed is distributed to the plurality of computing cores based on the weight instead of being distributed to the plurality of computing cores equally according to the number of the plurality of computing cores. Therefore, under the condition that the computing power of the plurality of computing cores is different, load balancing can be ensured based on the weight distribution data, so that the processing efficiency of parallel processing is improved, and the overall performance is improved.

Inventors

FENG RENGUANG
XU JIANCHANG
WANG XIANLI

Assignees

杭州海康威视数字技术股份有限公司

Dates

Publication Date: 20260512
Application Date: 20210819

Claims (9)

1. A method of parallel processing, the method comprising: determining the weight of each computing core in a plurality of computing cores according to the computing power of each computing core in the plurality of computing cores; determining the total number of data units included in the data to be processed; Determining a first processing number corresponding to a corresponding computing core in the computing cores based on the weight of each computing core in the computing cores, the total number and a designated number, wherein the designated number is determined according to the memory access number of the computing cores, the designated number is a minimum allocation unit for determining the number of data units required to be processed by each computing core in the computing cores, and the first processing number is an integer multiple of the designated number; Determining a remaining number based on the total number and a first processing number corresponding to each of the plurality of computing cores, the remaining number being less than a product of the number of cores and the specified number, the number of cores being a total number of the plurality of computing cores; Determining a second processing number corresponding to m computing cores in the plurality of computing cores based on the remaining number according to the weight of each computing core in the plurality of computing cores, wherein the weight of any computing core in the m computing cores is not lower than the weight of any computing core except the m computing cores in the plurality of computing cores, and m is smaller than or equal to the total number of the plurality of computing cores; Determining the number of data units to be processed by each computing core of the plurality of computing cores based on the first processing number corresponding to each computing core of the plurality of computing cores and the second processing number corresponding to the m computing cores; Dividing the data to be processed into a plurality of data blocks based on the number of data units required to be processed by each of the plurality of computing cores, wherein the plurality of data blocks are in one-to-one correspondence with the plurality of computing cores; Assigning each of the plurality of data blocks to a corresponding one of the plurality of compute cores; and processing the data to be processed in parallel through the plurality of computing cores.
2. The method of claim 1, wherein the data to be processed comprises input data of one or more operators in a convolutional neural network, different operators corresponding to different data allocation policies; the determining the total number of the data units included in the data to be processed includes: And determining the total number of data units included in the input data of the one or more operators according to the data distribution strategies respectively corresponding to the one or more operators.
3. The method of claim 2, wherein the one or more operators comprise a generic matrix multiplied GEMM convolution operator, input data of which comprises a first matrix and a second matrix; The determining the total number of the data units included in the input data of the one or more operators according to the data distribution strategies respectively corresponding to the one or more operators comprises the following steps: And taking the row number of the first matrix or the column number of the second matrix as the total number, wherein the row number of the first matrix indicates the size of the feature map output by the GEMM convolution operator, and the column number of the second matrix indicates the channel number of the feature map output by the GEMM convolution operator.
4. The method of claim 2, wherein the one or more operators comprise a pooling operator, input data of which comprises a feature map of a multi-channel to be processed; The determining the total number of the data units included in the input data of the one or more operators according to the data distribution strategies respectively corresponding to the one or more operators comprises the following steps: determining the batch quantity of the characteristic map of the multi-channel to be batched as the total number, or Determining the number of channels of the characteristic map of the multi-channel to be processed as the total number, or And determining the total number according to the batch quantity and the channel number.
5. The method of claim 2, wherein the one or more operators comprise an element-wise processing operator, the input data of the element-wise processing operator comprising a plurality of elements; The determining the total number of the data units included in the input data of the one or more operators according to the data distribution strategies respectively corresponding to the one or more operators comprises the following steps: And determining the element number of the plurality of elements as the total number.
6. The method according to any one of claims 1-5, wherein determining, according to the magnitude of the weight of each of the plurality of computing cores, a second number of processes corresponding to m computing cores of the plurality of computing cores based on the remaining number, comprises: Determining m second processing numbers based on the specified number and the remaining number; selecting m computing cores from the plurality of computing cores in order of increasing weight of each computing core from the plurality of computing cores; And determining the second processing number corresponding to the m computing cores from the m second processing numbers.
7. A parallel processing apparatus, the apparatus comprising: The determining module is used for determining the weight of each computing core in the plurality of computing cores according to the computing power of each computing core in the plurality of computing cores; an allocation module, configured to allocate data to be processed to the plurality of computing cores based on weights of the computing cores; the processing module is used for processing the data to be processed in parallel through the plurality of computing cores; Wherein the distribution module comprises: a first determining submodule, configured to determine a total number of data units included in the data to be processed; A second determining submodule, configured to determine, based on the weight and the total number of each of the plurality of computing cores, a number of data units that each of the plurality of computing cores needs to process; Dividing the data to be processed into a plurality of data blocks based on the number of data units required to be processed by each of the plurality of computing cores, wherein the plurality of data blocks are in one-to-one correspondence with the plurality of computing cores; An allocation submodule, configured to allocate each of the plurality of data blocks to a corresponding computing core of the plurality of computing cores; Wherein the second determining submodule is specifically configured to: Determining a first processing number corresponding to a corresponding computing core in the computing cores based on the weight of each computing core in the computing cores, the total number and a designated number, wherein the designated number is determined according to the memory access number of the computing cores, the designated number is a minimum allocation unit for determining the number of data units required to be processed by each computing core in the computing cores, and the first processing number is an integer multiple of the designated number; Determining a remaining number based on the total number and a first processing number corresponding to each of the plurality of computing cores, the remaining number being less than a product of the number of cores and the specified number, the number of cores being a total number of the plurality of computing cores; Determining a second processing number corresponding to m computing cores in the plurality of computing cores based on the remaining number according to the weight of each computing core in the plurality of computing cores, wherein the weight of any computing core in the m computing cores is not lower than the weight of any computing core except the m computing cores in the plurality of computing cores, and m is smaller than or equal to the total number of the plurality of computing cores; And determining the number of data units required to be processed by each computing core in the plurality of computing cores based on the first processing number corresponding to each computing core in the plurality of computing cores and the second processing number corresponding to the m computing cores.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, implements the steps of the method according to any of claims 1-6.
9. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the method of any of claims 1-6.

Description

Parallel processing method and device Technical Field The embodiment of the application relates to the technical field of computers, in particular to a parallel processing method and device. Background Currently, multi-core chips are widely used in various devices to increase the computing power of the devices. The multi-core chip comprises a plurality of computing cores, and the computing cores can execute computing tasks in parallel, so that the processing efficiency is improved. For example, when a multi-core chip is used for computation of data in a convolutional neural network, each of the plurality of computing cores may process a portion of the data to improve efficiency through parallel processing. In the related art, when a multi-core chip is used for parallel processing of data to be processed (such as data in a convolutional neural network), the data to be processed is equally distributed to the plurality of computing cores, that is, uniform data distribution is performed in accordance with the number of the plurality of computing cores, and each computing core processes the respective distributed data in parallel. However, in the case where the computational power (i.e., computational power) of the plurality of computational cores is different, the overall performance of the parallel processing is determined by the time consumption of the computational core with the worst computational power, which results in lower efficiency and poorer performance of the parallel processing. Disclosure of Invention The embodiment of the application provides a parallel processing method and device, which can improve the processing efficiency of parallel processing through a plurality of computing cores and improve the performance. The technical scheme is as follows: In one aspect, a parallel processing method is provided, the method including: determining the weight of each computing core in a plurality of computing cores according to the computing power of each computing core in the plurality of computing cores; Distributing data to be processed to the plurality of computing cores based on weights of the respective computing cores; and processing the data to be processed in parallel through the plurality of computing cores. Optionally, the distributing the data to be processed to the plurality of computing cores based on the weight of each computing core in the plurality of computing cores includes: Determining the total number of data units included in the data to be processed; Determining the number of data units to be processed by each of the plurality of computing cores based on the weight of each of the plurality of computing cores and the total number; Dividing the data to be processed into a plurality of data blocks based on the number of data units required to be processed by each of the plurality of computing cores, wherein the plurality of data blocks are in one-to-one correspondence with the plurality of computing cores; each of the plurality of data blocks is assigned to a corresponding one of the plurality of compute cores. Optionally, the data to be processed includes input data of one or more operators in the convolutional neural network, and different operators correspond to different data allocation strategies; The determining the total number of the data units included in the data to be processed comprises the following steps: And determining the total number of data units included in the input data of the one or more operators according to the data distribution strategies respectively corresponding to the one or more operators. Optionally, the one or more operators comprise a generic matrix multiplied GEMM convolution operator, input data of which comprises a first matrix and a second matrix; The determining the total number of the data units included in the input data of the one or more operators according to the data distribution strategies respectively corresponding to the one or more operators comprises the following steps: And taking the row number of the first matrix or the column number of the second matrix as the total number, wherein the row number of the first matrix indicates the size of the feature map output by the GEMM convolution operator, and the column number of the second matrix indicates the channel number of the feature map output by the GEMM convolution operator. Optionally, the one or more operators include a pooling operator, input data of which includes a feature map of the multi-channel to be processed; The determining the total number of the data units included in the input data of the one or more operators according to the data distribution strategies respectively corresponding to the one or more operators comprises the following steps: determining the batch quantity of the characteristic map of the multi-channel to be batched as the total number, or Determining the number of channels of the characteristic map of the multi-channel to be processed as the total number