Search

US-12619463-B2 - Thread creation on local or remote compute elements by a multi-threaded, self-scheduling processor

US12619463B2US 12619463 B2US12619463 B2US 12619463B2US-12619463-B2

Abstract

Representative apparatus, method, and system embodiments are disclosed for a self-scheduling processor which also provides additional functionality. Representative embodiments include a self-scheduling processor, comprising: a processor core adapted to execute a received instruction; and a core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another embodiment, the core control circuit is also adapted to schedule a fiber create instruction for execution by the processor core, to reserve a predetermined amount of memory space in a thread control memory to store return arguments, and to generate one or more work descriptor data packets to another processor or hybrid threading fabric circuit for execution of a corresponding plurality of execution threads. Event processing, data path management, system calls, memory requests, and other new instructions are also disclosed.

Inventors

  • Tony M. Brewer

Assignees

  • MICRON TECHNOLOGY, INC.

Dates

Publication Date
20260505
Application Date
20240812

Claims (20)

  1. 1 . An apparatus comprising: a memory interface coupled to a memory; a data cache; an instruction cache; and one or more processing elements configured to: access data from the data cache; and execute instructions from the instruction cache to perform operations comprising: receiving a thread create instruction from a parent thread executing on the one or more processing elements, the thread create instruction indicating a number of return parameters to be generated by a child thread created by the thread create instruction; reserving, via the memory interface and based on the number of return parameters, space in the memory to store the return parameters; creating the child thread based at least in part on the reserving of the space in the memory; and providing access to the return parameters from the reserved space to the parent thread based at least in part on a thread return instruction from the child thread.
  2. 2 . The apparatus of claim 1 , wherein the operations further comprise: receiving a join instruction from the parent thread; wherein the providing of the access to the return parameters from the reserved space to the parent thread is in response to the receipt of the join instruction.
  3. 3 . The apparatus of claim 2 , wherein the thread create instruction further indicates a caller identifier; and the caller identifier is provided to the parent thread in response to the receipt of the join instruction.
  4. 4 . The apparatus of claim 1 , wherein the thread create instruction from the parent thread uses a plurality of bits to indicate the number of return parameters.
  5. 5 . The apparatus of claim 1 , wherein the reserving of the space in the memory comprises reserving a predetermined number of bits for each return parameter.
  6. 6 . The apparatus of claim 1 , wherein the providing of the access to the return parameters from the reserved space to the parent thread comprises copying data from the reserved space to a register or register state of the parent thread.
  7. 7 . The apparatus of claim 1 , wherein the operations further comprise: in response to the thread return instruction, checking that all operations initiated by the child thread either have been completed or have been acknowledged.
  8. 8 . The apparatus of claim 1 , wherein the operations further comprise: waiting for all threads created by the child thread to complete before sending the thread return instruction from the child thread.
  9. 9 . The apparatus of claim 1 , wherein the thread create instruction further indicates whether to start the child thread on a local node or a remote node.
  10. 10 . The apparatus of claim 1 , wherein the creating of the child thread comprises: beginning execution of the child thread from an address read from a register of the parent thread.
  11. 11 . The apparatus of claim 1 , wherein the thread create instruction further indicates a number of call arguments, and the call arguments are accessed from registers of the parent thread.
  12. 12 . The apparatus of claim 1 , wherein at least one processing element of the one or more processing elements is a hybrid threading processor.
  13. 13 . A non-transitory machine-readable medium that stores a plurality of instructions that, when executed by one or more processing elements, cause the one or more processing elements to perform operations comprising: receiving a thread create instruction from a parent thread that indicates a number of return parameters to be generated by a child thread created by the thread create instruction; reserving, based on the number of return parameters, space in a memory to store the return parameters; creating the child thread based at least in part on the reserving of the space in the memory; and providing access to the return parameters from the reserved space to the parent thread based at least in part on a thread return instruction from the child thread.
  14. 14 . The non-transitory machine-readable medium of claim 13 , wherein the operations further comprise: receiving a join instruction from the parent thread; wherein the providing of the access to the return parameters from the reserved space to the parent thread is in response to the receipt of the join instruction.
  15. 15 . The non-transitory machine-readable medium of claim 13 , wherein the thread create instruction from the parent thread that indicates the number of return parameters uses a plurality of bits to indicate the number of return parameters and selects one of 0, 1, 2, or 4 parameters.
  16. 16 . The non-transitory machine-readable medium of claim 13 , wherein the reserving of the space in the memory comprises reserving a predetermined number of bits for each return parameter.
  17. 17 . The non-transitory machine-readable medium of claim 13 , wherein the providing of the return parameters from the reserved space to the parent thread comprises copying data from the reserved space to a register or register state of the parent thread.
  18. 18 . The non-transitory machine-readable medium of claim 13 , wherein the operations further comprise: in response to the thread return instruction, checking that all operations initiated by the child thread have either completed or been acknowledged.
  19. 19 . A method comprising: receiving, by one or more processing elements, a thread create instruction from a parent thread that indicates a number of return parameters to be generated by a child thread created by the thread create instruction; reserving, by the one or more processing elements, based on the number of return parameters, space in a memory to store the return parameters; creating the child thread based at least in part on the reserving of the space in the memory; and providing access to the return parameters from the reserved space to the parent thread based at least in part on a thread return instruction from the child thread.
  20. 20 . The method of claim 19 , further comprising: waiting for all threads created by the child thread to complete before sending the thread return instruction from the child thread.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of and claims the benefit of and priority to U.S. patent application Ser. No. 17/994,143, filed Nov. 25, 2022, inventor Tony M. Brewer, titled “Thread Creation on Local or Remote Compute Elements by a Multi-Threaded, Self-Scheduling Processor”, which is a continuation of and claims the benefit of and priority to U.S. patent application Ser. No. 16/399,817, filed Apr. 30, 2019 and issued Nov. 29, 2022 as U.S. Pat. No. 11,513,840 B2, inventor Tony M. Brewer, titled “Thread Creation on Local or Remote Compute Elements by a Multi-Threaded, Self-Scheduling Processor”, which is a nonprovisional of and claims the benefit of and priority to U.S. Provisional Patent Application No. 62/667,850, filed May 7, 2018, inventor Tony M. Brewer, titled “Thread Creation on Local or Remote Compute Elements by a Multi-Threaded, Self-Scheduling Processor”, which are commonly assigned herewith, and all of which are hereby incorporated herein by reference in their entireties with the same full force and effect as if set forth in their entireties herein (hereinafter referred to as the “related applications”). FIELD OF THE INVENTION The present invention, in general, relates to configurable computing circuitry, and more particularly, relates to a heterogeneous computing system which includes a self-scheduling processor, configurable computing circuitry with an embedded interconnection network, dynamic reconfiguration, and dynamic control over energy or power consumption. BACKGROUND OF THE INVENTION Many existing computing systems have reached significant limits for computation processing capabilities, both in terms of speed of computation, energy (or power) consumption, and associated heat dissipation. For example, existing computing solutions have become increasingly inadequate as the need for advanced computing technologies grows, such as to accommodate artificial intelligence and other significant computing applications. Accordingly, there is an ongoing need for a computing architecture capable of providing high performance using sparse data sets, involving limited or no data reuse, which typically cause poor cache hit rates. Such a computing architecture should be tolerant of latency to memory and allow high sustained executed instructions per clock. There is also an ongoing need for a computing architecture capable of providing high performance and energy efficient solutions for compute-intensive kernels, such as for computation of Fast Fourier Transforms (FFTs) and finite impulse response (FIR) filters used in sensing, communication, and analytic applications, such as synthetic aperture radar, 5G base stations, and graph analytic applications such as graph clustering using spectral techniques, machine learning, 5G networking algorithms, and large stencil codes, for example and without limitation. There is also an ongoing need for a processor architecture capable of significant parallel processing and further interacting with and controlling a configurable computing architecture for performance of any of these various applications. SUMMARY OF THE INVENTION As discussed in greater detail below, the representative apparatus, system and method provide for a computing architecture capable of providing high performance and energy efficient solutions for compute-intensive kernels, such as for computation of Fast Fourier Transforms (FFTs) and finite impulse response (FIR) filters used in sensing, communication, and analytic applications, such as synthetic aperture radar, 5G base stations, and graph analytic applications such as graph clustering using spectral techniques, machine learning, 5G networking algorithms, and large stencil codes, for example and without limitation. As mentioned above, sparse data sets typically cause poor cache hit rates. The representative apparatus, system and method provide for a computing architecture capable of allowing some threads to be waiting for response from memory while other threads are continuing to execute instructions. This style of compute is tolerant of latency to memory and allows high sustained executed instructions per clock. Also as discussed in greater detail below, the representative apparatus, system and method provide for a processor architecture capable of self-scheduling, significant parallel processing and further interacting with and controlling a configurable computing architecture for performance of any of these various applications. A self-scheduling processor is disclosed. In a representative embodiment, the processor comprises: a processor core adapted to execute a received instruction; and a core control circuit coupled to the processor core, the core control circuit adapted to automatically schedule an instruction for execution by the processor core in response to a received work descriptor data packet. In another representative embodiment, the processor comprises: a processor core adapted to execute