CN-122018914-A - CUDA source program compiling method and related equipment based on general compiler

CN122018914ACN 122018914 ACN122018914 ACN 122018914ACN-122018914-A

Abstract

The application provides a CUDA source program compiling method and related equipment based on a general compiler, the method comprises the steps of executing preprocessing operation on source codes of the CUDA source program to generate a first intermediate file and a second intermediate file respectively, scanning host side codes in the first intermediate file to identify target function identifiers representing equipment side execution semantics, executing reconstruction operation on functions containing the target function identifiers to generate proxy functions for triggering equipment side execution, scanning equipment side codes in the second intermediate file to identify the target function identifiers, executing semantic removal processing on the target function identifiers to generate equipment side executable function codes, generating host side target files, generating equipment side target files, further generating equipment side dynamic link files, and forming the target files. By adopting the scheme, the CUDA program can be converted into the target program which can run on the CPU, and the construction cost of the running environment is reduced.

Inventors

SHI ZHIXING
ZHAO FEI

Assignees

上海思朗科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260114

Claims (10)

1. A universal compiler-based CUDA source program compiling method, comprising: performing preprocessing operation on source codes of a CUDA source program to generate a first intermediate file and a second intermediate file respectively, wherein the preprocessing operation at least comprises expanding processing of macro definition related instructions; scanning a host side code in the first intermediate file in a host side code processing path to identify an objective function identifier representing the device side execution semantics, and performing a reconstruction operation on a function containing the objective function identifier to generate a proxy function for triggering the device side execution; scanning equipment side codes in the second intermediate file in an equipment side code processing path to identify the target function identification, and executing semantic removal processing on the target function identification to generate equipment side executable function codes; generating a host side target file based on the processed first intermediate file, generating a device side target file based on the processed second intermediate file, and further generating a device side dynamic link file; And merging the equipment side dynamic link file and the host side target file to form a target file.
2. The compiling method of claim 1, wherein the performing the preprocessing operation on the source code of the CUDA source program generates a first intermediate file and a second intermediate file, respectively, includes: Executing processing operation comprising a preprocessing stage on a source code file with a suffix of a document (cu) by using a g++ compiler, expanding macro definition, a conditional compiling instruction and a header file containing instruction in the source code file by calling preprocessing parameters, and outputting the source code expanded by the macro to a designated target file in the form of an intermediate file; Wherein the first intermediate file is generated when the-Ea.cu-o.a.cu.E command is employed, and the second intermediate file is generated when the-Ea.cu-o.c.E.dev command is employed.
3. The compiling method according to claim 1, wherein the objective function identification comprises __ global __ key words; the reconstruction of the function containing the target function identification comprises the steps of determining codes from the back of the __ global __ key to the first left curly brace { as a function head, determining codes between right curly braces matched with the left curly brace } as a function body, and determining the content between curly braces () in the function head as a parameter list, so that a function name, a parameter list and a function body are extracted.
4. The compiling method according to claim 3, wherein when generating the proxy function for triggering the device side execution, a first function, a second function and a third function are generated at an original position of the function where the __ global __ key is located; the first function is used for sending parameters and a function fn_get_ mangled of an execution request to the equipment side, the second function is used for calling the functions fn of the parameters as cudaLaunchKernel, the third function is used for replacing a function fn __ stub_ kernelaunch __ of a kernel start grammar < < < >, the second function can call the first function, and the third function can call the second function.
5. The compiling method according to claim 4, wherein the generating process of the first function comprises using fn_get_ mangled as a function name and using the parameter list as a shape parameter; the method comprises the steps of storing the name of a so file to be executed, the name of a function to be executed and the number of parameters at the equipment side by using a thread local variable funcBack, wherein the thread local variable funcBack is used when cudaLaunchKernel is called; the second function comprises a void Type parameter and store the void Assigning a type parameter to a thread local variable funcBack, taking the parameter list as a local variable, and calling the fn_get_ mangled; The parameter list of the third function includes the parameter list and dim3 gridDim, dim3 blockDim, size _t sm_size and cudaStream _t stream, then constructs a parameter pointer array, and calls cudaLaunchKernel ((void) )fn,gridDim,blockDim,args,m_size,stream)。
6. The compiling method according to any one of claims 3 to 5, further comprising, after completing the reconstruction of the call to the function in which the __ global __ key is located: replacing the __ global __ key with null; Replacing the __ host __ modifier with null for a function modified only by __ host __; For a function modified only by __ devices __, replacing the whole function with null; For functions modified by __ host __ and __ device __ simultaneously, the __ host __ and __ device __ modifiers are replaced with null and function implementation is preserved.
7. The compiling method according to claim 1, wherein when generating the device-side executable function code, the global object is subjected to the following processing: For global variables modified by __ DEVICE __, replacing the __ DEVICE __ with __ attribute __ (section (. DEVICE)), and for global variables modified by __ constant __, replacing the __ constant __ modification with __ attribute __ (section (. CONST))); wherein, in determining __ device __ modifies the object type, it is determined whether the __ device __ modifies a function or a variable by determining whether there is a parameter list between the first left bracket or the semicolon after __ device __.
8. The compiling method of claim 1, wherein the merging the device-side dynamic link file and the host-side target file to form the target file comprises: sequentially attaching the binary content of the device-side dynamic link file to the end of the host-side target file; attaching a fixed-length identification character string for identifying the existence state of the equipment side code after the equipment side dynamically links the file content; And adding a fixed length description field for describing the byte length of the device side dynamic link file after the fixed length identification character string to form a target file structure which can be resolved and split at the operation stage.
9. A universal compiler-based CUDA source program compiling apparatus, comprising: the preprocessing module is configured to execute preprocessing operation on source codes of the CUDA source program to respectively generate a first intermediate file and a second intermediate file, wherein the preprocessing operation at least comprises expanding processing on macro definition related instructions; A processing module configured to scan the host side code in the first intermediate file in a host side code processing path to identify an object function identifier representing a device side execution semantic, and perform a reconstruction operation on a function containing the object function identifier, generate a proxy function for triggering the device side execution, and scan the device side code in the second intermediate file in a device side code processing path to identify the object function identifier, and perform a semantic removal process on the object function identifier, generating a device side executable function code; the merging module is configured to generate a host side target file based on the processed first intermediate file, generate a device side target file based on the processed second intermediate file, further generate a device side dynamic link file, and merge the device side dynamic link file and the host side target file to form a target file.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed; And/or a computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor realizes the steps of the method of any of claims 1 to 8; And/or a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 8.

Description

CUDA source program compiling method and related equipment based on general compiler Technical Field The application relates to the technical field of computers, in particular to a CUDA source program compiling method based on a general compiler and related equipment. Background In the process of algorithm development, an operating environment needs to be built according to development requirements, namely, a graphics processor (Graphics Processing Unit, GPU), a special high-speed interconnection technology (NVSwitch) and remote direct memory access (Remote Direct Memory Access, RDMA) need to be connected according to different requirements. If the environment is built by using real equipment, the period is long and the cost is high, and the experimental environment can be built quickly by using a Central Processing Unit (CPU) to simulate the GPU operation. However, when using the CPU to simulate GPU operation, the CUDA source program needs to be compiled into the target file, but the NVCC compiler (NVIDIA CUDA Compiler) can only compile the CUDA source program into the target file on the GPU side, and cannot be executed on the CPU. The above information disclosed in the background section is only for enhancement of understanding of the background of the application and therefore it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art. Disclosure of Invention The embodiment of the application provides a CUDA source program compiling method based on a general purpose compiler and related equipment, which can convert the CUDA program into a target program capable of running on a CPU, thereby reducing the construction cost of an operating environment and shortening the construction period. In a first aspect of the embodiment of the present application, there is provided a CUDA source program compiling method based on a general purpose compiler, including: performing preprocessing operation on source codes of a CUDA source program to generate a first intermediate file and a second intermediate file respectively, wherein the preprocessing operation at least comprises expanding processing of macro definition related instructions; scanning a host side code in the first intermediate file in a host side code processing path to identify an objective function identifier representing the device side execution semantics, and performing a reconstruction operation on a function containing the objective function identifier to generate a proxy function for triggering the device side execution; scanning equipment side codes in the second intermediate file in an equipment side code processing path to identify the target function identification, and executing semantic removal processing on the target function identification to generate equipment side executable function codes; generating a host side target file based on the processed first intermediate file, generating a device side target file based on the processed second intermediate file, and further generating a device side dynamic link file; And merging the equipment side dynamic link file and the host side target file to form a target file. In an optional embodiment of the present application, the performing a preprocessing operation on the source code of the CUDA source program generates a first intermediate file and a second intermediate file, respectively, including: Executing processing operation comprising a preprocessing stage on a source code file with a suffix of a document (cu) by using a g++ compiler, expanding macro definition, a conditional compiling instruction and a header file containing instruction in the source code file by calling preprocessing parameters, and outputting the source code expanded by the macro to a designated target file in the form of an intermediate file; Wherein the first intermediate file is generated when the-Ea.cu-o.a.cu.E command is employed, and the second intermediate file is generated when the-Ea.cu-o.c.E.dev command is employed. In an alternative embodiment of the present application, the objective function identification includes __ global __ key; the reconstruction of the function containing the target function identification comprises the steps of determining codes from the back of the __ global __ key to the first left curly brace { as a function head, determining codes between right curly braces matched with the left curly brace } as a function body, and determining the content between curly braces () in the function head as a parameter list, so that a function name, a parameter list and a function body are extracted. In an optional embodiment of the present application, when the proxy function for triggering the device side execution is generated, a first function, a second function and a third function are generated at an original location of the function where the __ global __ key is located; The first function is used for sending parameters and a funct