CN-114546643-B - ARM architecture-oriented NUMA (non-uniform memory access) aware parallel computing method and system

CN114546643BCN 114546643 BCN114546643 BCN 114546643BCN-114546643-B

Abstract

The invention discloses an ARM-architecture-oriented NUMA perception parallel computing method and system, wherein the method comprises the steps of obtaining computing tasks and carrying out type division on the computing tasks to obtain divided computing tasks; determining the size of an execution space and selecting an execution strategy, dividing the execution space according to NUMA architecture information, distributing the divided execution subspace to a CPU core, executing a calculation task and reducing a calculation result to obtain a final result. The system comprises a type dividing module, a setting module, an allocation module and a reduction module. The invention reasonably distributes the parallel task threads to the CPU cores of different NUMA nodes by executing the pre-distribution of the space and the redistribution strategy of the computing resources, thereby effectively improving the performance. The NUMA perception parallel computing method and system oriented to ARM architecture can be widely applied to the field of computer parallel computing.

Inventors

FAN YAOZHONG
HUANG DAN
CHEN ZHIGUANG
LU YUTONG

Assignees

中山大学

Dates

Publication Date: 20260512
Application Date: 20220217

Claims (5)

1. The NUMA perception parallel computing method facing ARM architecture is characterized by comprising the following steps of: obtaining a calculation task and performing type division on the calculation task to obtain a divided calculation task; determining the size of an execution space and selecting an execution strategy; Dividing an execution space according to NUMA architecture information, and distributing the divided execution subspace to a CPU core; Executing a calculation task and reducing a calculation result to obtain a final result; The execution strategies comprise a one-dimensional space execution strategy, a multidimensional space execution strategy and a grouping execution strategy; The step of dividing the execution space according to NUMA architecture information and distributing the divided execution subspace to the CPU core specifically comprises the following steps: acquiring NUMA architecture information of the current equipment; The NUMA architecture information comprises the number of NUMA nodes and the number of CPU cores contained in each NUMA node; calculating to obtain distribution coefficients according to NUMA architecture information of the equipment; sequentially allocating the sub execution space quantity to each NUMA node according to a sequential allocation mode based on the allocation coefficients; selecting a one-dimensional space execution strategy, wherein the dividing the execution space according to NUMA architecture information specifically comprises the following steps: according to the number of CPU cores available to the device, uniformly dividing a linear execution interval to obtain a divided execution subinterval; and the lengths of the execution subintervals after the division are equal.
2. The parallel computing method of NUMA perception oriented to ARM architecture according to claim 1, wherein the step of executing the computing task and reducing the computing result to obtain the final result comprises the following steps: Executing a computing task based on the execution subinterval; aiming at the inside of the execution subinterval, a serial reduction mode is used, the reduction result of each calculation task is output and stored in a corresponding memory address space, and the reduction between CPU cores in the NUMA node is completed; and carrying out result reduction among NUMA nodes to obtain a final result.
3. The parallel computing method of NUMA perception facing ARM architecture according to claim 1, wherein selecting a multidimensional space execution strategy, dividing an execution space according to NUMA architecture information specifically comprises: Grouping other execution space dimensions except the last execution space dimension according to a preset unit length, wherein the last space dimension is not divided; and combining the groups of each execution space dimension to obtain a divided execution subspace.
4. The ARM architecture oriented NUMA aware parallel computing method of claim 1, wherein selecting a packet execution policy, the partitioning execution space according to NUMA architecture information specifically comprises: grouping the CPU cores to obtain the grouping number; Dividing the first dimension according to the grouping number and dividing the second dimension according to the core number of each group to obtain a divided execution subspace.
5. An ARM architecture oriented NUMA aware parallel computing system for performing the ARM architecture oriented NUMA aware parallel computing method of claim 1, comprising: The type dividing module is used for acquiring the calculation tasks and carrying out type division on the calculation tasks to obtain divided calculation tasks; The setting module is used for determining the size of the execution space and selecting an execution strategy; the allocation module is used for dividing the execution space according to the NUMA architecture information and allocating the divided execution subspace to the CPU core; and the reduction module is used for executing the calculation task and reducing the calculation result to obtain a final result.

Description

ARM architecture-oriented NUMA (non-uniform memory access) aware parallel computing method and system Technical Field The invention relates to the field of computer parallel computing, in particular to an ARM-architecture-oriented NUMA sensing parallel computing method and system. Background The ARM architecture is a processor architecture that employs a unique ARM instruction set system and is developed according to different application ranges. Unlike the Complex Instruction Set (CISC) architecture of the x86 architecture, the ARM architecture is a Reduced Instruction Set (RISC) architecture, reducing the number of unusual instructions and chip complexity. With the Load/Store instruction architecture, only Load, store instructions may access memory, and other data processing instructions operate only on registers. Therefore, the ARM architecture processor has the characteristics of low power consumption, low cost and high performance. In recent years, the number of integrated transistors on a single chip has grown gradually closer to the bottleneck, and in order to pursue continuous improvement of processor performance, the design of processors has gradually been shifted from increasing clock frequencies to increasing the number of processor cores, and multi-core processors have emerged. Currently, many multicore processors employ a NUMA (non-uniform memory access) architecture. Unlike conventional unified memory access architectures, the memory in a NUMA architecture is physically distributed, with different CPU cores subordinate to different NUMA nodes, each with its own integrated memory controller, and the processor accesses the memory of its own NUMA node faster than the memory of other NUMA nodes. The parallel program often contains a plurality of parallel computing tasks, the tasks may be independent, and the computing results of each task are not affected. It is also possible in some scenarios that there is some dependency between parallel tasks, which involves the exchange of data between threads. Because processors have performance differences when accessing memory of different NUMA nodes, when processing parallel computing tasks with dependencies, performance overhead can be greatly increased if more memory accesses across NUMA nodes are involved. Disclosure of Invention In order to solve the technical problems, the invention aims to provide an ARM-architecture-oriented NUMA perception parallel computing method and system, which adopt different parallel data processing processes according to different types of computing tasks. The first technical scheme adopted by the invention is that the NUMA perception parallel computing method facing the ARM framework comprises the following steps: obtaining a calculation task and performing type division on the calculation task to obtain a divided calculation task; determining the size of an execution space and selecting an execution strategy; Dividing an execution space according to NUMA architecture information, and distributing the divided execution subspace to a CPU core; and executing the calculation task and reducing the calculation result to obtain a final result. Preferably, the task type of the divided computing task is parallel for circulation, and the execution strategy comprises a one-dimensional space execution strategy, a multidimensional space execution strategy and a grouping execution strategy. Preferably, the step of dividing the execution space according to the NUMA architecture information and allocating the divided execution subspace to the CPU core specifically includes: acquiring NUMA architecture information of the current equipment; The NUMA architecture information comprises the number of NUMA nodes and the number of CPU cores contained in each NUMA node; calculating to obtain distribution coefficients according to NUMA architecture information of the equipment; Based on the allocation coefficients, allocating the number of sub-execution spaces to each NUMA node in turn in a sequential allocation manner. Preferably, the step of performing the calculation task and reducing the calculation result to obtain a final result specifically includes: Executing a computing task based on the execution subinterval; aiming at the inside of the execution subinterval, a serial reduction mode is used, the reduction result of each calculation task is output and stored in a corresponding memory address space, and the reduction between CPU cores in the NUMA node is completed; and carrying out result reduction among NUMA nodes to obtain a final result. Preferably, selecting a one-dimensional space execution strategy, and dividing the execution space according to NUMA architecture information specifically includes: according to the number of CPU cores available to the device, uniformly dividing a linear execution interval to obtain a divided execution subinterval; and the lengths of the execution subintervals after the division are equal. Preferably,