JP-2026076307-A - Hardware management of direct memory access commands

JP2026076307AJP 2026076307 AJP2026076307 AJP 2026076307AJP-2026076307-A

Abstract

[Problem] To provide a processor device and system that improves the use of the direct memory access (DMA) engine. [Solution] In system 300, a method for hardware management of DMA transfer commands includes: accessing the DMA transfer command by a first DMA engine 314; and determining a first portion of the data transfer requested by the DMA transfer command. The transfer of the first portion of the data transfer by the first DMA engine is initiated at least partially based on the DMA transfer command. Similarly, the second portion of the data transfer by the second DMA engine is initiated at least partially based on the DMA transfer command. After the first and second portions of the data transfer have been transferred, an indicator is generated to signal the completion of the data transfer requested by the DMA transfer command. [Selection Diagram] Figure 3

Inventors

ジョセフグレイトハウス
ショーンケリー
アランスミス
アンソニーアサロ
リンーリンワン
ミリンドネムレカール
ハリテンジララ
フェリックスクーリング

Assignees

アドバンスト・マイクロ・ディバイシズ・インコーポレイテッド
エーティーアイ・テクノロジーズ・ユーエルシー

Dates

Publication Date: 20260511
Application Date: 20260212
Priority Date: 20211101

Claims (13)

A processor device, A base integrated circuit (IC) die comprising a plurality of processing stacked die chiplets 3D stacked on the base IC die, and an inter-chip data fabric that connects the plurality of processing stacked die chiplets in a communicative manner, The system comprises a plurality of DMA engines stacked in 3D on the base IC die, each configured to perform a portion of the data transfer requested by a DMA transfer command. Processor device.
Each of the plurality of DMA engines includes a single command engine that drives the plurality of transfer engines, A processor device according to claim 1.
Each of the plurality of DMA engines is configured to receive a DMA notification indicating that the DMA transfer command is stored in a DMA buffer in system memory. A processor device according to claim 1.
The first DMA engine among the plurality of DMA engines is configured to transmit a cache probe request to a cache memory communicably coupled to a first processing stacked die chiplet and to transmit the first portion of the data transfer based on receiving a return response indicating a cache hit in the cache memory. A processor device according to claim 1.
The second DMA engine among the plurality of DMA engines is configured to transmit the cache probe request to a cache memory communicably coupled to the second processing stacked die chiplet and to transfer the second portion of the data transfer from the owner main memory based on receiving a return response indicating a cache miss in the cache memory. The processor device according to claim 4.
Each of the plurality of DMA engines is configured to independently determine the portion of the data transfer by interleaving the total DMA transfer size among the plurality of DMA engines. A processor device according to claim 1.
A primary DMA engine is provided, which receives the DMA transfer command and is configured to divide the DMA transfer command into a plurality of smaller workloads. A processor device according to claim 1.
The primary DMA engine is configured to transmit different workloads from among the multiple smaller workloads to each of the multiple DMA engines. The processor device according to claim 7.
It is a system, A host processor is connected to a parallel processor multichip module in a communicative manner. The aforementioned parallel processor multichip module is A base integrated circuit (IC) die comprising a plurality of processing stacked die chiplets 3D stacked on the base IC die, and an inter-chip data fabric that connects the plurality of processing stacked die chiplets in a communicative manner, The system comprises a plurality of DMA engines stacked in 3D on the base IC die, each configured to perform a portion of the data transfer requested by a DMA transfer command. system.
A primary DMA engine configured to receive the DMA transfer command and divide the DMA transfer command into a plurality of smaller workloads, comprising a primary DMA engine configured to transmit different workloads from the plurality of smaller workloads to each of the plurality of DMA engines. The system according to claim 9.
Each of the plurality of DMA engines is configured to independently determine the portion of the data transfer by interleaving the total DMA transfer size among the plurality of DMA engines. The system according to claim 9.
The first DMA engine among the plurality of DMA engines is configured to transmit a cache probe request to a cache memory communicably coupled to a first processing stacked die chiplet and to transmit the first portion of the data transfer based on receiving a return response indicating a cache hit in the cache memory. The system according to claim 9.
The second DMA engine among the plurality of DMA engines is configured to transmit the cache probe request to a cache memory communicably coupled to the second processing stacked die chiplet and to transfer the second portion of the data transfer from the owner main memory based on receiving a return response indicating a cache miss in the cache memory. The system according to claim 12.

Description

A System Direct Memory Access (DMA) engine is a module that coordinates direct memory access transfers of data between devices in a computer system (e.g., input/output interfaces and display controllers) and memory, or between different locations within memory. The DMA engine is often located on a processor such as a central processing unit (CPU) or graphics processor (GPU) and receives commands from applications running on the processor. Based on the commands, the DMA engine reads data from a DMA source (e.g., a first memory buffer defined in memory) and writes data to a DMA destination (e.g., a second buffer defined in memory). This disclosure will be better understood by referring to the accompanying drawings, and its many features and advantages will become apparent to those skilled in the art. The use of the same reference numerals in different drawings indicates similar or identical items. This is a block diagram of a computing system implementing a multi-die processor, according to several embodiments.This is a partial block diagram of an exemplary computing system for performing hardware management of DMA commands, according to several embodiments.This block diagram shows part of an exemplary multiprocessor computing system for performing hardware management of DMA commands according to several embodiments.This block diagram shows an example of a system that implements hardware-managed partitioning of transfer commands based on cache status, according to several embodiments.This block diagram shows another example of a system that implements hardware-managed partitioning of transfer commands, according to several embodiments.This flowchart illustrates a method for performing hardware management partitioning of DMA transfer commands according to several embodiments. Conventional processors include one or more direct memory access engines (DMCs) for reading and writing blocks of data stored in system memory. The DMCs relieve the processor core of the burden of managing transfers. In response to data transfer requests from the processor core, the DMCs provide the necessary control information to the corresponding source and destination, enabling the data transfer operation to be performed without delaying the computation code, thus allowing communication and computation to overlap in time. By asynchronously processing the formation of control information and communication, the DMCs are freed to perform other tasks while waiting for the satisfaction of the data transfer request. Distributed architectures are becoming increasingly common as an alternative to monolithic processing architectures, where physically or logically separated processing units work collaboratively through high-performance interconnects. An example of a distributed architecture is the chiplet architecture, which gains the advantage of manufacturing some parts of the processing unit on smaller nodes, while allowing other parts to be manufactured on larger nodes if they do not benefit from the reduced scale of the smaller nodes. The number of direct memory access engines is likely to increase in chiplet-based systems (e.g., compared to an equivalent monolithic non-chiplet-based design). To improve system performance by enhancing the use of the direct memory access engine, Figures 1 to 6 illustrate systems and methods that utilize hardware-managed adjustments for processing direct memory transfer commands. In various embodiments, the method for hardware-managing DMA transfer commands includes a first DMA engine accessing the DMA transfer command and determining a first portion of the data transfer requested by the DMA transfer command. The transfer of the first portion of the data transfer by the first DMA engine is initiated at least partially based on the DMA transfer command. Similarly, the second portion of the data transfer by a second DMA engine (i.e., a DMA engine different from the first DMA engine) is initiated at least partially based on the DMA transfer command. After transferring the first and second portions of the data transfer, an indicator is generated to signal the completion of the data transfer requested by the DMA transfer command. In this way, the work specified by the transfer command is divided among the DMA engines, thus increasing total bandwidth usage without the need to enlarge individual DMA engines or add functions to increase overall DMA throughput or data fabric bandwidth usage. Figure 1 shows a block diagram of one embodiment of a computing system 100 implementing a multi-die processor, according to several embodiments. In various embodiments, the computing system 100 includes at least one or more processors 102A to 102N, a fabric 104, an input/output (I/O) interface 106, a memory controller 108, a display controller 110, and other devices 112. In various embodiments, to support the execution of instructions for graphics and other types of workloads, the computing system 100 includes a host processor 11