CN-122018996-A - Parallel thread virtualization and SIMT semantic degradation execution method and system

CN122018996ACN 122018996 ACN122018996 ACN 122018996ACN-122018996-A

Abstract

The invention provides a parallel thread virtualization and SIMT semantic degradation execution method and system, which belong to the technical field of parallel computing and comprise the steps of receiving a parallel source program and identifying kernel data, extracting the parallel semantics from the kernel data and constructing intermediate representation, constructing an active mask to indicate an effective logic thread set in a strip by adopting a banding mapping irrelevant to vector length based on the intermediate representation, splitting host and equipment end program instructions according to the kernel data, carrying out vector predicate on conditional branches to realize independent execution of each logic thread, intercepting kernel start call at a host side through a gasket layer, serializing kernel parameters into command descriptors, writing the command descriptors into an acyclic ring queue of a shared memory, notifying equipment to execute through a doorbell mechanism, analyzing the command descriptors by equipment end parts, configuring execution resources and jumping to the generated kernel code for execution. The invention realizes efficient and transparent degradation of the parallel thread model.

Inventors

DAI HONGJUN
ZHANG ZHENYU
ZHAI MINGJIE
LI BING
MA YUMING
LI HAOYANG

Assignees

山东大学

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. A parallel thread virtualization and SIMT semantic degradation execution method, comprising: Receiving a parallel source program described by adopting a SIMT programming model and identifying kernel data; Extracting parallel semantics from the kernel data and constructing an intermediate representation, constructing an active mask to indicate an effective logic thread set in a stripe by adopting vector length independent banding mapping based on the intermediate representation, and splitting a host and equipment end program instruction according to the kernel data; vector predicates are carried out on conditional branches based on the program instructions so that each logic thread can be independently executed; generating equipment end codes comprising vector instructions and matrix acceleration instructions according to the intermediate representation; the method comprises the steps of intercepting kernel starting call at a host side through a gasket layer, serializing kernel parameters into command descriptors, writing the command descriptors into a ring-free queue of a shared memory, informing equipment to execute through a doorbell mechanism, analyzing the command descriptors by equipment end fixing, configuring execution resources and jumping to the generated kernel code for executing.
2. The parallel thread virtualization and SIMT semantic downgrade execution method of claim 1, wherein the extracting parallel semantics and constructing an intermediate representation comprises: scalar evolution and data dependency analysis are carried out on the linear Cheng Suoyin expression at the front end or the middle end of the compiler, and the thread level is lifted into an analyzable multidimensional iteration space in parallel; An intermediate representation is constructed that contains a grid/thread block/thread tertiary structure, and the private state and shared state of each logical thread are explicit in the intermediate representation.
3. The method for performing parallel thread virtualization and SIMT semantic downgrade according to claim 1, wherein the constructing an active mask using vector length independent striping mapping indicates a set of valid logical threads in a stripe, specifically: generating a device end code frame independent of the hardware vector length in a compiling stage, inquiring the hardware vector length by the device or issuing the hardware vector length by a host machine at running time, and calculating the striping times: In the formula, The number of striping times; is an upward rounding function; is the logical thread group size; Is the hardware vector length; for the kth stripe, an active mask is constructed to indicate the set of active logical threads within the stripe, enabling adaptive execution of the same binary on different hardware vector length chip models.
4. The method for performing parallel thread virtualization and SIMT semantic degradation according to claim 1, wherein splitting host and device side program instructions according to the kernel data comprises: Generating a multi-thread control code at a host end by using a grid level/thread block level scheduling and task distributing part according to a host programming strategy; The numerical value calculation part in the kernel is converted into a device-end vector code by adopting a full function vectorization or regional vectorization mode, each vector channel simulates a logic thread, and the private scalar variable of the logic thread is lifted into a vector register.
5. The parallel thread virtualization and SIMT semantic downgrade execution method of claim 1, wherein vector predicates conditional branches based on the program instructions to implement each logical thread to execute independently, specifically: maintaining a current execution mask, calculating an operator mask when a branch condition C is encountered: In the formula, 、 Sub-masks for true, false paths; Masking for current execution before entering the branch; Predicate vectors obtained after vectorization for conditional branches; Is a bit-wise AND; Is bitwise NOT; And saving/restoring the mask context for the nested branches through a mask stack, thereby simulating each logic thread to independently execute without a hardware branch convergence unit.
6. The method for performing parallel thread virtualization and SIMT semantic degradation according to claim 1, wherein when host scheduling based on shared memory non-lock ring queues is adopted, the method specifically comprises: step S501, providing a dynamic link library compatible with the existing run-time ABI at the host side; Step S502, writing a command descriptor into an annular queue in a shared memory area of a host-equipment when the host operates, wherein the annular queue adopts a producer-consumer model, namely, a host maintenance write pointer and an equipment maintenance read pointer; Step S503, the host informs the equipment through a doorbell mechanism; Step S504, the device end runs the extremely simple firmware or runs the time cycle, namely reads a command descriptor when doorbell trigger or polling finds that a host maintenance write pointer is different from a device maintenance read pointer; Step S505, after the kernel execution is completed, the device writes back a completion mark and a return code, and pushes the device to maintain a read pointer, and the host acquires a completion state to realize synchronous semantics consistent with the existing heterogeneous runtime.
7. A parallel thread virtualization and SIMT semantic downgrade execution system, comprising: a data acquisition module configured to receive a parallel source program described using a SIMT programming model and identify kernel data; an instruction mapping module configured to extract parallel semantics for the kernel data and construct an intermediate representation, construct an active mask to indicate a set of valid logical threads in a stripe using vector length independent striping mapping based on the intermediate representation, split host and device side program instructions from the kernel data; a predicate processing module configured to vector predicate conditional branches based on the program instructions to enable each logical thread to execute independently; a vector and matrix execution module configured to generate device-side code comprising vector instructions and matrix acceleration instructions from the intermediate representation; the transmitting and executing module is configured to intercept kernel starting call through a gasket layer at a host side, sequence kernel parameters into command descriptors, write the command descriptors into a ring-free queue of the shared memory, inform equipment of executing through a doorbell mechanism, and enable equipment end parts to analyze the command descriptors, configure execution resources and jump to generated kernel code for executing.
8. An electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the parallel thread virtualization and SIMT semantic degradation execution method of any one of claims 1-6.
9. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the parallel thread virtualization and SIMT semantic degradation execution method of any one of claims 1-6.
10. A computer program product, the computer program product comprising executable instructions stored on a computer readable storage medium; The parallel thread virtualization and SIMT semantic downgrade execution method of any one of claims 1-6 is implemented when a processor of an electronic device reads the executable instructions from the computer readable storage medium and executes the executable instructions.

Description

Parallel thread virtualization and SIMT semantic degradation execution method and system Technical Field The invention belongs to the technical field of parallel computing, and particularly relates to a parallel thread virtualization and SIMT semantic degradation execution method and system. Background The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art. With the rapid increase of the demands of artificial intelligence training and reasoning, big data analysis, scientific computing and other loads on computing power, heterogeneous computing architectures (host CPU and accelerator cooperation) have become mainstream. Existing general-purpose graphics processors are typically equipped with hardware-level thread bundle (Warp) scheduling, branch divergence processing, and multi-level memory systems, and bind with mature software ecological depths, forming a complete software closed loop from programming model to compiler, runtime, and math library. Meanwhile, the novel parallel computing chip for energy efficiency and area optimization gradually adopts a special self-defined instruction set in the field, integrates hardware structures such as an extensible vector computing unit, a matrix multiplication accelerating unit and the like, but usually improves the computing density through simplifying hardware control logic, so that a hardware thread bundle scheduler and a branch convergence unit of a traditional GPU are lacking. The architecture differences described above result in the existing single-instruction-stream multithreading (SIMT) ecological inventory program being difficult to run directly on such chips, requiring the user to rewrite or manually rewrite to a specific assembly/non-mainstream language, high migration costs and difficulty in achieving native optimization levels. In the existing solution, although the PTX-level dynamic translation can realize compatibility to a certain extent, the problems of high simulation overhead, difficulty in fully utilizing the custom vector/matrix hardware characteristics and the like generally exist. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides a parallel thread virtualization and SIMT semantic degradation execution method and a system, which are oriented to a custom instruction set parallel computing chip, realize transparent compatibility to SIMT ecology under the condition of lacking a hardware Warp scheduler and maximize the utilization of bottom vector/matrix hardware computing power by means of 'logical thread group' abstraction, vector length irrelevant modeling, control flow masking full-path execution and host programming operation based on a non-loop queue on the custom instruction set parallel computing chip lacking a traditional GPU hardware thread bundle scheduler. The invention can efficiently and transparently map the SIMT parallel thread model to the user-defined SIMD/matrix instruction set in a degrading way on the premise of not changing or slightly changing the upper source code, and provides a technical scheme for the cooperative operation of a low-delay host and equipment in a matching way. To achieve the above object, one or more embodiments of the present invention provide the following technical solutions: in a first aspect, the present invention discloses a parallel thread virtualization and SIMT semantic degradation execution method, including: Receiving a parallel source program described by adopting a SIMT programming model and identifying kernel data; Extracting parallel semantics from the kernel data and constructing an intermediate representation, constructing an active mask to indicate an effective logic thread set in a stripe by adopting vector length independent banding mapping based on the intermediate representation, and splitting a host and equipment end program instruction according to the kernel data; vector predicates are carried out on conditional branches based on the program instructions so that each logic thread can be independently executed; generating equipment end codes comprising vector instructions and matrix acceleration instructions according to the intermediate representation; the method comprises the steps of intercepting kernel starting call at a host side through a gasket layer, serializing kernel parameters into command descriptors, writing the command descriptors into a ring-free queue of a shared memory, informing equipment to execute through a doorbell mechanism, analyzing the command descriptors by equipment end fixing, configuring execution resources and jumping to the generated kernel code for executing. In a second aspect, the present invention discloses a parallel thread virtualization and SIMT semantic downgrade execution system, comprising: a data acquisition module configured to receive a parallel source program described using a SIMT programming model