Search

US-12619564-B2 - Primary input-output queue serving host and guest operating systems concurrently

US12619564B2US 12619564 B2US12619564 B2US 12619564B2US-12619564-B2

Abstract

Systems, apparatuses, and methods for implementing a primary input/output (PIO) queue for host and guest operating systems (OS's) are disclosed. A system includes a PIO queue, one or more compute units, and a control unit. The PIO queue is able to store work commands for multiple different types of OS's, including host and guest OS's. The control unit is able to dispatch multiple work commands from multiple OS's to execute concurrently on the compute unit(s). This allows for execution of work commands by different OS's without the processing device(s) having to incur the latency of a world switch.

Inventors

  • Xiaojing Ma
  • Ling-Ling Wang
  • Jin Xu
  • ZengRong Huang
  • Lina Ma
  • Wei Shao
  • LingFei Shi

Assignees

  • ADVANCED MICRO DEVICES, INC.

Dates

Publication Date
20260505
Application Date
20210730

Claims (20)

  1. 1 . An apparatus comprising: control circuitry configured to dispatch for execution on a plurality of compute circuits in a given processor of one or more processors: a first work command retrieved from a queue, wherein the first work command is generated by a first operating system (OS); and a second work command retrieved from the queue, responsive to the second work command meeting criteria to be dispatched for execution, wherein the second work command is generated by a second OS different from the first OS; and wherein at least a portion of execution of the second command occurs concurrently with execution of the first work command in the given processor.
  2. 2 . The apparatus as recited in claim 1 , wherein the first OS is a host OS, and wherein the second OS is a guest OS.
  3. 3 . The apparatus as recited in claim 1 , wherein the criteria include the plurality of compute circuits comprising available resources to perform concurrent execution of commands from a plurality of operating systems.
  4. 4 . The apparatus as recited in claim 3 , wherein the control circuitry is further configured to dispatch the second work command subsequent to dispatching the first work command within a first latency that is less than a second latency associated with a world switch operation.
  5. 5 . The apparatus as recited in claim 1 , wherein the first work command and the second work command are direct memory access (DMA) work commands.
  6. 6 . The apparatus as recited in claim 1 , further comprising a hypervisor configured to: determine whether the second work command meets the criteria to be dispatched to the one or more compute circuits without invoking a world switch operation; and wait for completion of the first work command prior to the second work command being dispatched to the one or more compute circuits, responsive to determining that the second work command does not meet the criteria.
  7. 7 . The apparatus as recited in claim 1 , wherein the control circuitry is further configured to concurrently store contexts for the first OS and the second OS during concurrent execution of the first work command and the second work command.
  8. 8 . A method comprising: dispatching, by control circuitry, for execution on a plurality of compute circuits in a given processor of one or more processors: a first work command retrieved from a queue, wherein the first work command is generated by a first operating system (OS); and a second work command retrieved from the queue, responsive to the second work command meeting criteria to be dispatched for execution, wherein the second work command is generated by a second OS different from the first OS; and wherein at least a portion of execution of the second command occurs concurrently with execution of the first work command in the given processor.
  9. 9 . The method as recited in claim 8 , wherein the first OS is a host OS, and wherein the second OS is a guest OS.
  10. 10 . The method as recited in claim 8 , wherein the criteria include the one or more compute circuits comprising available resources to perform concurrent execution of commands from the plurality of operating systems.
  11. 11 . The method as recited in claim 10 , further comprising dispatching the second work command subsequent to dispatching the first work command within a first latency that is less than a second latency associated with a world switch operation.
  12. 12 . The method as recited in claim 8 , wherein the first work command and the second work command are direct memory access (DMA) work commands.
  13. 13 . The method as recited in claim 8 , further comprising: determining whether the second work command meets the criteria to be dispatched to the one or more compute circuits without invoking a world switch operation; and waiting for completion of the first work command prior to dispatching the second work command to the one or more compute circuits responsive to determining that the second work command does not meet the criteria.
  14. 14 . The method as recited in claim 8 , further comprising concurrently storing contexts for the first OS and the second OS during concurrent execution of the first work command and the second work command.
  15. 15 . A system comprising: one or more processors comprising a plurality of compute circuits configured to execute a plurality of operating systems; a queue for storing work commands generated by the plurality of operating systems running on the one or more processors; and control circuitry in a given processor of the one or more processors configured to dispatch for execution on the plurality of compute circuits in the given processor: a first work command retrieved from a queue, wherein the first work command is generated by a first operating system (OS); and a second work command retrieved from the queue, responsive to the second work command meeting criteria to be dispatched for execution, wherein the second work command is generated by a second OS different from the first OS; and wherein at least a portion of execution of the second command occurs concurrently with execution of the first work command in the given processor.
  16. 16 . The system as recited in claim 15 , wherein the first OS is a host OS, and wherein the second OS is a guest OS.
  17. 17 . The system as recited in claim 15 , wherein the criteria include the one or more compute circuits comprising available resources to perform concurrent execution of commands from the plurality of operating systems.
  18. 18 . The system as recited in claim 17 , wherein the control circuitry is further configured to dispatch the second work command subsequent to dispatching the first work command within a first latency that is less than a second latency associated with a world switch operation.
  19. 19 . The system as recited in claim 15 , further comprising a hypervisor configured to: determine whether the second work command meets the criteria to be dispatched to the one or more compute circuits without invoking a world switch operation; and wait for completion of the first work command prior to dispatching the second work command to the one or more compute circuits responsive to determining that the second work command does not meet the criteria.
  20. 20 . The system as recited in claim 15 , wherein the control circuitry is further configured to concurrently store contexts for the first OS and the second OS during concurrent execution of the first work command and the second work command.

Description

BACKGROUND Description of the Related Art Virtualization has been used in data processing devices for a variety of different purposes. Generally, virtualization of a data processing device may include providing one or more privileged programs with access to a virtual machine over which the privileged program has full control, but the control of the physical device is retained by a virtual machine manager (VMM). The privileged program, referred to herein as a guest, provides commands and other information targeted to hardware expected by the guest. The VMM intercepts the commands, and assigns hardware of the data processing device to execute each intercepted command. Virtualization may be implemented in software (e.g., the VMM mentioned above) without any specific hardware virtualization support in the physical machine on which the VMM and its virtual machines execute. In other implementations, the hardware of the data processing device can provide support for virtualization. Both the VMM and the guests are executed by one or more processors included in the physical data processing device. Accordingly, switching between execution of the VMM and the execution of guests occurs in the processors over time. For example, the VMM can schedule a guest for execution, and in response the hardware executes the guest VM. At various points in time, a switch from executing a guest to executing the VMM also occurs so that the VMM can retain control over the physical machine (e.g., when the guest attempts to access a peripheral device, when a new page of memory is to be allocated to the guest, when it is time for the VMM to schedule another guest, etc.). A switch between a guest and the VMM (in either direction) is referred to for purposes of discussion as a “world switch.” Generally, the world switch involves saving processor state for the guest/VMM being switched away from, and restoring processor state for the guest/VMM being switched to. BRIEF DESCRIPTION OF THE DRAWINGS The advantages of the methods and mechanisms described herein may be better understood by referring to the following description in conjunction with the accompanying drawings, in which: FIG. 1 is a block diagram of one implementation of a computing system. FIG. 2 is a block diagram of virtual machines and a hypervisor in accordance with some implementations. FIG. 3 is a block diagram of one implementation of a computing system. FIG. 4 is a block diagram of one implementation of resources allocated for virtual machines. FIG. 5 is a block diagram of one implementation of a computing system capable of concurrently executing multiple commands from multiple operating systems (OS's). FIG. 6 is a generalized flow diagram illustrating one implementation of a method for command queue serving multiple operating systems concurrently. FIG. 7 is a generalized flow diagram illustrating one implementation of a method for performing data transfers for multiple operating systems concurrently. FIG. 8 is a generalized flow diagram illustrating one implementation of a method for determining whether to execute commands concurrently. FIG. 9 is a generalized flow diagram illustrating one implementation of a method for determining whether to invoke a world switch operation. DETAILED DESCRIPTION OF IMPLEMENTATIONS In the following description, numerous specific details are set forth to provide a thorough understanding of the methods and mechanisms presented herein. However, one having ordinary skill in the art should recognize that the various implementations may be practiced without these specific details. In some instances, well-known structures, components, signals, computer program instructions, and techniques have not been shown in detail to avoid obscuring the approaches described herein. It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements. Various systems, apparatuses, and methods for implementing a primary input/output (PIO) queue for host and guest operating systems (OS's) are disclosed herein. In one implementation, a system includes a PIO queue, one or more compute units, and a control unit. The PIO queue is able to store work commands for multiple different types of OS's, including host and guest OS's. The control unit is able to dispatch multiple work commands from multiple OS's to execute concurrently on the compute unit(s). This allows for execution of work commands by different OS's without the processing device(s) having to incur the latency of a world switch. Referring now to FIG. 1, a block diagram of one implementation of a computing system 100 is shown. In one implementation, computing system 100 includes at least processors 105A-N, input/output (I/O) interfaces 120, bus 125, memory controller(s) 130, network interface 135, memory device(s) 140, display controller 150,