CN-122019296-A - Performance monitoring and counting method and device, electronic equipment and storage medium
Abstract
The disclosure provides a performance monitoring and counting method and device, electronic equipment and a storage medium, and relates to the technical field of processor design. The device comprises a plurality of performance monitoring slave units respectively configured in different chip partitions of a processor, wherein the performance monitoring slave units are used for independently counting performance events generated by calculation modules to be subjected to performance analysis and generating counting results, the counting result output requests of the plurality of performance monitoring slave units are uniformly managed through a performance monitoring master unit, and the counting results of all sources are differentially stored in a target cache unit through a system bus according to a preset address mapping mode. According to the scheme, accurate distinguishing of performance event sources can be achieved under a multi-module parallel operation scene, so that monitoring data has clear space boundary and source relevance, write-in conflict is reduced, stability of data management is improved, and reliability, expandability and monitoring precision of a performance monitoring system under a complex parallel architecture are improved.
Inventors
- Request for anonymity
Assignees
- 摩尔线程智能科技(北京)股份有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251218
Claims (19)
- 1. A performance monitoring counting apparatus for use in a processor supporting multitasking parallel processing, the apparatus comprising: The performance monitoring slave units are respectively associated with the calculation modules to be subjected to performance analysis of all chip partitions in the processor, and each performance monitoring slave unit is used for independently counting performance events generated by the associated calculation module and generating corresponding counting results; And the performance monitoring master unit is in communication connection with the at least two performance monitoring slave units, and is used for receiving and managing the counting result output requests from the at least two performance monitoring slave units, and storing the counting result in the target cache unit through the system bus based on the counting result output requests.
- 2. The apparatus of claim 1, wherein each of the performance monitoring slave units comprises: The system comprises at least one performance event accumulator, at least one processing module and at least one processing module, wherein the at least one performance event accumulator is used for parallelly counting at least one input performance event signal, and each performance event corresponds to one performance event accumulator; The control signal processor is used for generating a start counting command, a stop counting command and a dump command according to the received control signal; and the output buffer is used for buffering the current counting result of the at least one performance event accumulator when the dump command is triggered.
- 3. The apparatus according to claim 1 or 2, wherein the performance monitoring counting apparatus further comprises: The central control signal generator is in communication connection with the control signal processor and is used for generating and broadcasting a unified global control signal; Wherein the control signal received by the control signal processor comprises a global control signal from the central control signal generator and/or control information generated based on a local register configuration.
- 4. The apparatus of claim 3, wherein said central control signal generator outputs at least two of said global control signals, at least two of said global control signals including an enable signal and a command signal; The control signal processor generates a start count or stop count command by detecting an edge of the enable signal and generates a dump command by detecting an edge of the command signal.
- 5. The apparatus of claim 3, wherein the operating modes of the control signal processor include an event trigger mode and an interval trigger mode; In the event triggered mode, the dump command is triggered by a local register configuration; in the interval triggering mode, the dump command is automatically triggered according to a configured interval period.
- 6. The apparatus of claim 5, wherein when the control signal processor receives a global control signal from the central control signal generator, the trigger mode of the local register configuration is ignored and a command is generated directly from the global control signal.
- 7. The apparatus of claim 1, wherein the performance monitoring master unit includes an arbiter for performing a round robin arbitration on write requests from the at least two performance monitoring slaves, obtaining an arbitration result including the calculation result write order, and writing the count result to the target cache unit over a system bus based on the arbitration result.
- 8. The apparatus of claim 1, wherein the performance monitoring master unit is configured to write count results of different performance monitoring slave units into mutually non-overlapping address areas in the target cache unit according to a predetermined address mapping manner to achieve the differential storage.
- 9. The apparatus according to claim 1 or 8, wherein for the same performance monitoring slave unit, the performance monitoring master unit writes the count results of the dumps at different times into address areas in the target cache unit continuously or at fixed step intervals according to a predetermined address increment rule, so as to realize the differentiated storage of different batch data of the same unit.
- 10. The apparatus of claim 1 or 2, wherein each of the performance monitoring slave units further comprises: And the dump address calculator is used for dynamically calculating the target memory address corresponding to each dump operation according to the configured base address, the initial address offset and the dump step length.
- 11. The apparatus of claim 10, wherein the dump address calculator is further configured to perform a wrapping mode or a stop mode when the calculated target memory address is outside of the configured memory space; in the wrapping mode, the target memory address wraps around to the configured base address and continues to be dumped; And in the stop mode, stopping counting and reporting an address overflow error.
- 12. The apparatus of claim 2, wherein the count mode of the performance event accumulator comprises a relative count mode or an absolute count mode; Under the relative counting mode, after each dump operation is completed, the counting result corresponding to the performance event accumulator is cleared; In the absolute count mode, the count result of the performance event accumulator remains unchanged after dumping until the count stop command is received and then cleared.
- 13. The apparatus of claim 1, wherein the performance monitoring slave unit is further configured to generate dump pointer information, the dump pointer information including a number of dumps and a last dump completion flag; the performance monitoring master unit writes the dump pointer information into the target cache unit together with or separately from the count result.
- 14. The apparatus of claim 13, wherein the performance monitoring slave unit triggers writing of the dump pointer information once after each completion of a specified number of dump operations according to a configured pointer update frequency, and/or After receiving the stop counting command and completing the last dumping operation, triggering one writing of the dumping pointer information.
- 15. The apparatus of claim 1, wherein the performance monitoring slave unit further comprises: The state reporting module is used for writing the running state and the error state into the state register; Wherein the run state and error state include one or more combinations of a last dump complete flag, a dump command overlap flag, a counter overflow flag, and an address space overflow flag.
- 16. The apparatus of claim 1, wherein the performance monitoring slave unit receives configuration information via a configuration register; The configuration information includes one or more combinations of enable bits, count mode bits, address overflow handling mode bits, interval trigger periods, pointer update frequency, memory base address, memory space size, starting address offset, and dump step size.
- 17. A performance monitoring and counting method, applied to a processor supporting multitasking parallel processing, the processor comprising a performance monitoring and counting device according to any one of claims 1-16, the performance monitoring and counting device comprising at least two performance monitoring slave units and a performance monitoring master unit for uniformly managing all the performance monitoring slave units, the method comprising: Through each performance monitoring slave unit, performance events generated when the associated calculation module to be subjected to performance analysis runs are counted independently, and corresponding counting results are generated; and responding to the counting result output requests from each performance monitoring slave unit, carrying out unified management through the performance monitoring master unit, and storing the counting results from different performance monitoring slave units in a target cache unit through a system bus according to a preset address mapping mode based on the counting result output requests.
- 18. An electronic device, comprising: Processor, and A memory having stored thereon computer readable instructions which when executed by the processor implement the performance monitoring counting method of claim 17.
- 19. A computer readable storage medium, having stored thereon a computer program which when executed by a processor implements the performance monitoring counting method of claim 17.
Description
Performance monitoring and counting method and device, electronic equipment and storage medium Technical Field The disclosure relates to the technical field of processor design, in particular to a performance monitoring and counting method and device, electronic equipment and a storage medium. Background In a multi-task parallel operation scenario, a graphics processor (Graphics Processing Unit, GPU) generally needs to count multiple types of performance events generated during a processing process in order to evaluate the use condition of chip resources, so as to assist software in identifying performance bottlenecks and optimizing a scheduling policy. Currently, a shared performance counting mechanism is commonly adopted in the related art, that is, a set of global counters performs cumulative statistics on performance events generated by a plurality of modules inside a processor. However, with the improvement of the parallelism of the processor and the diversification of task types, the scheme of the shared performance counter gradually exposes the problem of data source mixing, and when a plurality of tasks or a plurality of computing modules run simultaneously, the global counter cannot distinguish performance events of different sources, so that performance data is difficult to definitely belong to and cannot reflect the actual running behavior of each independent task or chip partition. In addition, in an application scenario in which performance data acquisition is required to be performed at a higher frequency or with finer granularity, the related scheme is difficult to consider both system overhead and monitoring accuracy in terms of counting instantaneity, a result management mode and a data writing mechanism. Therefore, how to accurately distinguish performance events from different sources and reasonably manage statistical results under a multi-task parallel architecture becomes a technical problem to be solved in the field. It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art. Disclosure of Invention An object of the embodiments of the present disclosure is to provide a performance monitoring and counting method, a performance monitoring and counting device, an electronic device, and a computer readable storage medium, so as to reduce write-in conflicts and improve stability of data management, and improve reliability, expandability, and monitoring precision of a performance monitoring system under a complex parallel architecture. Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure. According to a first aspect of embodiments of the present disclosure, there is provided a performance monitoring and counting apparatus for use in a processor supporting multitasking parallel processing, the apparatus comprising: The performance monitoring slave units are respectively associated with the calculation modules to be subjected to performance analysis of all chip partitions in the processor, and each performance monitoring slave unit is used for independently counting performance events generated by the associated calculation module and generating corresponding counting results; And the performance monitoring master unit is in communication connection with the at least two performance monitoring slave units, and is used for receiving and managing the counting result output requests from the at least two performance monitoring slave units, and storing the counting result in the target cache unit through the system bus based on the counting result output requests. In some example embodiments of the present disclosure, based on the foregoing aspects, each of the performance monitoring slave units includes: The system comprises at least one performance event accumulator, at least one processing module and at least one processing module, wherein the at least one performance event accumulator is used for parallelly counting at least one input performance event signal, and each performance event corresponds to one performance event accumulator; The control signal processor is used for generating a start counting command, a stop counting command and a dump command according to the received control signal; and the output buffer is used for buffering the current counting result of the at least one performance event accumulator when the dump command is triggered. In some example embodiments of the present disclosure, based on the foregoing aspect, the performance monitoring counting device further includes: The central control signal generator is in communication connection with the control signal processor and is used for generating and broadcasting a unified global control s