CN-122021503-A - Graph processor modeling and analyzing method and device

CN122021503ACN 122021503 ACN122021503 ACN 122021503ACN-122021503-A

Abstract

The invention discloses a method and a device for modeling and analyzing a graph processor, and belongs to the field of auxiliary design of computer architecture. The method comprises the steps of establishing a unified calculation model decoupled from a storage architecture model, obtaining full-link micro-architecture behavior data through FPGA actual measurement, constructing a bandwidth constraint model to determine optimal parallelism of a calculation unit, performing closed-loop calibration on bus delay and internal logic parameters of a simulation model by utilizing actual measurement data, and performing space evaluation and performance quantitative analysis on multiple storage architecture schemes on the calibrated reference model while keeping the optimal parallelism unchanged. The device comprises an upper computer, an FPGA hardware platform and a communication interface component, wherein the upper computer is integrated with system-level simulation software and hardware control software. The invention solves the problems that the parallelism is difficult to reasonably determine and the simulation precision of pure software is insufficient by constructing a software and hardware closed loop cooperative mechanism, and realizes high-precision and high-efficiency evaluation of a graph processor computing subsystem and a storage subsystem.

Inventors

YU LE
CHANG XIAO

Assignees

北京工商大学

Dates

Publication Date: 20260512
Application Date: 20260305

Claims (10)

1. A graph processor modeling and analysis apparatus, the apparatus comprising: The upper computer is configured to run system-level simulation software and hardware control software and is used for providing a unified modeling environment and test control interface; the FPGA hardware platform is configured to deploy a hardware register transmission level model of the graph processor, and the model generates a real access interaction signal when running; The communication interface component is respectively connected with the upper computer and the FPGA hardware platform and is used for transmitting control instructions and sampling data between the upper computer and the FPGA hardware platform; Wherein, the host computer is inside includes: the simulation modeling module is used for constructing a unified calculation model consistent with the hardware register transmission level model logic structure and a storage architecture model capable of parameterizing configuration; The hardware actual measurement control module is used for driving the FPGA hardware platform to run a test task through the communication interface component and collecting micro-architecture behavior data generated by the FPGA hardware platform in the running process; The data interaction interface module is connected between the simulation modeling module and the hardware actual measurement control module and is used for formatting the micro-architecture behavior data and then injecting the micro-architecture behavior data into the simulation modeling module so as to establish a data closed loop between a simulation environment and the hardware actual measurement environment.
2. The device of claim 1, wherein the upper computer is connected with the FPGA hardware platform through the communication interface component, a data interaction interface module, a simulation modeling module and a hardware actual measurement control module are configured in the upper computer, the simulation modeling module and the hardware actual measurement control module are respectively connected to two sides of the module, the hardware actual measurement control module is used for driving the FPGA hardware platform to run a test task, and the data interaction interface module is used for transmitting data between the simulation modeling module and the hardware actual measurement control module.
3. A method of modeling and analyzing a graph processor, the method comprising the steps of: Establishing physical connection between the upper computer and the FPGA hardware platform through the communication interface component to complete establishment of a communication link and system initialization; Constructing a unified calculation model of the graph processor by utilizing the simulation modeling module, carrying out abstract description on a calculation unit structure, task scheduling and parallel execution logic, and decoupling with a storage architecture model through a standardized interface, wherein the unified calculation model does not contain interconnection delay and buffering time sequence parameters based on hardware actual measurement data at the stage; The hardware actual measurement control module is utilized to drive an FPGA hardware platform to run a calculation task, and the calculation characteristics and access behavior characteristics of a graph processor under the real hardware condition are collected to generate actual measurement data for subsequent model constraint and calibration; Extracting actual calculation parameters and unit memory access strength parameters of a calculation unit based on the actual measurement data, and injecting the parameters into the unified calculation model; establishing an initial constraint model in an intermediate state that the calculation unit parameters of the unified calculation model have completed actual measurement constraint, but the interconnection delay and the buffer time sequence parameters are not calibrated based on actual measurement data; After the optimal parallelism is determined, calibrating interconnection delay, buffer depth and time sequence characteristics of the unified calculation model by further utilizing access delay, buffer blocking and time sequence behavior parameters reflected in the measured data to generate a calibrated reference simulation model; On the premise of keeping the optimal parallelism unchanged, multiple architecture schemes are simulated by configuring different storage architecture models and interconnection parameters, and a performance evaluation result is output.
4. The method of claim 3, wherein the process of collecting, processing and injecting the measured data comprises controlling hardware behavior monitoring logic in the FPGA hardware platform to export the measured data file and convert the measured data file into an intermediate format data file through an automation script operated by an upper computer, and the simulation modeling module reads the intermediate format file and updates simulation model parameters in an initialization stage.
5. The method of claim 3, wherein in the step of constructing the unified computing model, the storage architecture models include an on-chip explicit management memory, a local cache model, and an off-chip memory model, and each storage architecture model provides a unified read-write delay and throughput configuration interface for the unified computing model, and different storage architecture models can be freely collocated.
6. The method of claim 3, wherein the step of determining the optimal parallelism of the graphics processor includes gradually increasing the number of parallel computing units while maintaining the reference memory configuration, stopping increasing the number of parallel computing units when the system parallel acceleration efficiency is lower than a preset threshold or the memory bandwidth utilization reaches a preset threshold, and determining the current number as the optimal parallelism.
7. The method of claim 3, wherein the step of generating the calibrated reference simulation model includes comparing the system level simulation result with the hardware actual measurement result, calculating a delay correction factor, and injecting the delay correction factor into an interconnection module of the simulation model to converge errors of the simulation output result and the hardware actual measurement result to a preset range.
8. The method of claim 3 wherein the step of simulating multiple storage architecture model schemes and outputting performance evaluation results comprises setting an architecture including only off-chip memory as a performance benchmark scheme and verifying error convergence of simulation models and hardware measured data, statistically calculating and transmitting overlap rates in an architecture incorporating on-chip explicit management memory, determining that a data migration mechanism is valid if the overlap rates are higher than a preset threshold, and further incorporating a local cache model, determining that a cache hit rate and additional pipeline overhead are valid only if both performance improvement amplitude and hit rate satisfy the threshold.
9. The method of claim 3 further comprising the step of area-performance marginal benefit decision of calculating an area efficiency ratio of each storage architecture model solution relative to a reference solution, the area efficiency ratio being a ratio of a percent performance improvement to a percent predicted area increase, and determining the solution with the highest area efficiency ratio as an optimal energy efficiency architecture by the system according to a user policy.
10. The method of claim 3, wherein the method is applied in a system-on-chip design flow to evaluate matching and performance of a graphics processor to an on-chip interconnect bus and memory architecture prior to chip streaming.

Description

Graph processor modeling and analyzing method and device Technical Field The invention relates to the technical field of computer architecture aided design, in particular to a graph processor modeling and analyzing method and device combining system-level software simulation and FPGA hardware actual measurement. Background With the development of artificial intelligence and big data analysis, graph Processing (Graph Processing) is increasingly used. The graph structure data has the characteristics of large scale, high sparsity, irregular access and the like, and the graph structure data provides extremely high challenges for the memory bandwidth and access delay of the processor. To improve efficiency, the adoption of multi-memory architectures including DRAM, on-chip explicit management memory (SPM), and Local Cache model (Local Cache) has become a dominant trend in graphics processor architecture design. However, there are significant shortcomings in the prior art in modeling and evaluating such multi-storage architectures, in that existing modeling devices are largely divided into two categories in terms of devices, and each has limitations. The software simulation device based on the pure general server has high flexibility, but due to lack of real hardware constraint, model parameter setting of the software simulation device often depends on experience, so that simulation of micro-architecture behaviors such as bus contention, burst transmission and the like is severely distorted. The other type is a verification device based on a pure FPGA development board, and although the verification device can acquire real performance data, the comprehensive and layout wiring time of the FPGA is extremely long, and the traversal and evaluation of hundreds of storage hierarchy combinations are difficult to quickly carry out. At present, an integrated device which can be physically connected with an upper computer simulation environment and an FPGA hardware environment and realize automatic closed-loop flow of data is lacking. In terms of methods, existing evaluation methods typically depth couple the computational model with the storage structure. When different storage subsystems need to be evaluated, it is often necessary to modify the code of the compute core or redesign Register Transfer Level (RTL) logic, resulting in an inability to perform fair performance comparisons under uniform compute benchmarks. Therefore, there is a need for a graph processor modeling and analysis method and apparatus that combines the flexibility of system level simulation with the accuracy of FPGA measurements. Disclosure of Invention Aiming at the problems of insufficient system-level simulation precision, difficult reasonable determination of parallel computing scale and low evaluation efficiency of a multi-storage architecture existing in the existing design process of the architecture of the graph processor, the invention aims to provide a modeling and analyzing method and device of the graph processor. By constructing a closed loop cooperative mechanism between a system-level simulation environment and an FPGA hardware actual measurement platform, a data constraint and model calibration means based on real hardware behaviors is introduced, modeling flexibility is ensured, and meanwhile, accuracy of an evaluation result is remarkably improved, so that a high-confidence quantitative decision basis is provided for architecture design of a graph processor and a storage subsystem thereof. In order to achieve the purpose, the invention provides the following technical proposal at two aspects of the device and the method. The invention provides a modeling and analyzing device of a graph processor, which comprises an upper computer, an FPGA hardware platform and a communication interface component. The system comprises a host computer, an FPGA hardware platform, a communication interface component and a system-level simulation system, wherein the host computer is configured to run system-level simulation modeling software and hardware control software and is used for constructing a unified calculation model of a graph processor, controlling a hardware running process and processing and analyzing measured data, the FPGA hardware platform is configured to deploy an RTL model of the graph processor and generate real access and control behaviors when running a graph calculation task, and the communication interface component is physically connected with the host computer and the FPGA hardware platform and is used for transmitting control instructions and reading back the hardware measured data, so that a data interaction channel is established between a system-level simulation environment and a hardware measured environment. Further, the upper computer comprises a simulation modeling module, a hardware actual measurement control module and a data interaction interface module. The simulation modeling module is used for constructing a unifi