Search

US-12619459-B2 - Extending parallel software threads

US12619459B2US 12619459 B2US12619459 B2US 12619459B2US-12619459-B2

Abstract

A method for executing a software program, comprising: identifying in a program a plurality of host threads, each for performing some of a plurality of parallel sub-tasks of a task; and for each of the host threads: generating device threads, each associated with the host thread, each for one of the parallel tasks associated thereof; generating a parent thread associated with the host thread for communicating with the device threads; configuring a host processing circuitry to execute the parent thread; and configuring at least one other processing circuitry to execute in parallel the device threads while the host processing circuitry executes the parent thread; and for at least one of the host threads: receiving by the parent thread a value from the at least one other processing circuitry, the value generated when executing at least one of the device threads associated with the at least one host thread.

Inventors

  • Elad RAZ
  • ILAN TAYARI
  • Dan SHECHTER

Assignees

  • NEXT SILICON LTD

Dates

Publication Date
20260505
Application Date
20220114

Claims (18)

  1. 1 . A method for executing a software program, comprising: identifying in a software program a plurality of host threads, each for performing some of a plurality of parallel sub-tasks of a task of the software program; and for each of the plurality of host threads: generating a plurality of device threads, each associated with the host thread, each for one of the respective some of the plurality of parallel tasks associated thereof; generating a parent thread having reduced functionality compared to the host thread, wherein the parent thread is associated with the host thread for communicating with the plurality of device threads, wherein the parent thread is generated by removing from the host thread the respective some of the plurality of parallel tasks of each of the plurality of device threads, and wherein the parent thread preserves coherency of the software program on a host processing circuitry; configuring a host processing circuitry to execute the parent thread; and configuring at least one other processing circuitry, connected to the host processing circuitry, to execute in parallel the plurality of device threads while the host processing circuitry executes the parent thread; and for at least one of the plurality of host threads: receiving by the parent thread associated with the at least one host thread at least one value from the at least one other processing circuitry, the at least one value generated by the at least one other processing circuitry executing at least one of the plurality of device threads associated with the at least one host thread.
  2. 2 . The method of claim 1 , wherein at least some host threads of the plurality of host threads are identified in a source code of the software program.
  3. 3 . The method of claim 2 , wherein the source code of the software program comprises a marking, indicative of at least some of the plurality of parallel sub-tasks; and wherein identifying the at least some host threads is according to the marking.
  4. 4 . The method of claim 3 , wherein the marking in the source code is implemented using at least one multithreading application programming interface (API) selected from a group of multithreading APIs consisting of: OpenMP Architecture Review Board Open Multi-Processing (OpenMP) API, OpenACC, Message Passing Interface (MPI), and Intel Threading Building Blocks (TBB).
  5. 5 . The method of claim 1 , wherein at least some other host threads of the plurality of host threads are identified in a binary code of the software program.
  6. 6 . The method of claim 5 , wherein the binary code of the software program comprises another marking, indicative of at least one code region; wherein at least one of the plurality of host threads is configured for executing the at least one code region; and wherein identifying the at least some other host threads is according to the other marking.
  7. 7 . A system for executing a software program, comprising at least one hardware processor adapted for: identifying in a software program a plurality of host threads, each for performing some of a plurality of parallel sub-tasks of a task of the software program; and for each of the plurality of host threads: generating a plurality of device threads, each associated with the host thread, each for one of the respective some of the plurality of parallel tasks associated thereof; generating a parent thread having reduced functionality compared to the host thread, wherein the parent thread is associated with the host thread for communicating with the plurality of device threads, wherein the parent thread is generated by removing from the host thread the respective some of the plurality of parallel tasks of each of the plurality of device threads, and wherein the parent thread preserves coherency of the software program on a host processing circuitry; configuring a host processing circuitry to execute the parent thread; and configuring at least one other processing circuitry, connected to the host processing circuitry, to execute in parallel the plurality of device threads while the host processing circuitry executes the parent thread; wherein for at least one of the plurality of host threads, executing in parallel the respective plurality of device threads associated therewith comprises receiving by the parent thread associated therewith at least one value from the at least one other processing circuitry, the at least one value generated by the at least one other processing circuitry executing at least one of the plurality of device threads associated with the at least one host thread.
  8. 8 . The system of claim 7 , wherein the host processing circuitry is the at least one hardware processor.
  9. 9 . The system of claim 8 , wherein the at least one hardware processor is further adapted for generating the plurality of device threads, generating the parent thread, configuring the host processing circuitry and configuring the least one other processing circuitry while executing the software program.
  10. 10 . The system of claim 9 , wherein the at least one hardware processor is further adapted for identifying the plurality of host threads while executing the software program.
  11. 11 . The system of claim 7 , wherein the at least one other processing circuitry is an interconnected computing grid.
  12. 12 . The system of claim 11 , wherein for each of the plurality of host threads configuring the at least one other processing circuitry comprises, for each of the plurality of device threads associated with the host thread: generating a dataflow graph; and projecting the dataflow graph on part of the interconnected computing grid.
  13. 13 . The system of claim 12 , wherein the interconnected computing grid comprises a plurality of reconfigurable logical elements; and wherein projecting the dataflow graph on part of the interconnected computing grid comprises reconfiguring at least some of the reconfigurable logical elements.
  14. 14 . The system of claim 7 , further comprising at least one shared computing resource connected to the host processing circuitry; and wherein the host processing circuitry is further configured for accessing the at least one shared computing resource in response to the at least one parent thread receiving the at least one value from the at least one other processing circuitry.
  15. 15 . The system of claim 14 , wherein the at least one shared computing resource is selected from a group of computing resources consisting of: a memory area, a non-volatile storage, a co-processor, a digital communication network interface, a monitor, and an input device.
  16. 16 . The system of claim 7 , wherein the at least one hardware processor is further adapted for accessing a plurality of statistical values collected while executing the software program; wherein the plurality of device threads is generated further according to the plurality of statistical values.
  17. 17 . The system of claim 16 , wherein the software program comprises a plurality of telemetry instructions for collecting at least some of the plurality of statistical values.
  18. 18 . The system of claim 16 , wherein at least one of the plurality of statistical values is selected from the group of statistical values consisting of: a statistical value indicative of a value of a program counter, a statistical value indicative of executing a loop, a statistical value indicative of invoking a first execution block from a second execution block, a statistical value indicative of a data value, for example a frequently used data value or a variable data value, a statistical value indicative of a range of data values, a statistical value indicative of a pattern of a plurality of data values, a statistical value indicative of memory utilization, and a statistical value indicative of a bandwidth of a plurality of memory accesses.

Description

FIELD AND BACKGROUND OF THE INVENTION Some embodiments described in the present disclosure relate to a computerized system and, more specifically, but not exclusively, to a computerized system with a plurality of processing circuitries. In the field of computer science, a thread of execution is a sequence of computer instructions that can be managed independently by a scheduler. A process is an instance of a software program that is being executed. The term multithreading, as used herewithin, refers to a model of program execution that allows for multiple threads to be created within a process, executing independently but concurrently sharing resources of the process. Some examples of a resource of the process are computing circuitry, memory address space, cache memory address translation buffers, values of some global variables and dynamically allocated variables, and some access privileges. Multithreading aims to increase utilization of a single processing core, for example a central processing unit (CPU) or a single core in a multi-core processor, by interleaving execution of a plurality of threads on the single processing core. Thus, a multithreaded software program is a software program that is executed in a process comprising a plurality of threads, where the threads execute concurrently, sharing resources of the process. The term multiprocessing, in computer science, refers to concurrent processing on multiple processing cores, for example a multi-core processor or a plurality of hardware processing circuitries. Multiprocessing may apply both to executing multiple processes in parallel on multiple processing cores, and to executing multiple threads of a single process in parallel on multiple processing cores. The current disclosure focuses on executing multiple threads of a single process in parallel on multiple processing cores, but is not limited thereto. Some modern programming models combine multithreading with multiprocessing to increase performance of a system executing a software program, for example to increase an amount of tasks performed by the system (throughput) and additionally or alternatively to reduce an amount of time it takes the system to perform a task (latency). SUMMARY OF THE INVENTION It is an object of some embodiments of the present disclosure to describe a system and a method for executing a software program where the software program has a task comprising a plurality of parallel sub-tasks and where executing the software program comprises executing a plurality of parallel host threads, each for performing some of the plurality of parallel sub-tasks. In some such embodiments, for each of the plurality of parallel host threads a plurality of device threads are each executed on one of one or more processing circuitries for performing one of the respective plurality of parallel sub-tasks of the host thread, and a parent thread associated with the host thread is executed on a host processing circuitry for communicating with the plurality of device threads. The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures. According to a first aspect, a method for executing a software program comprises: identifying in a software program a plurality of host threads, each for performing some of a plurality of parallel sub-tasks of a task of the software program; and for each of the plurality of host threads: generating a plurality of device threads, each associated with the host thread, each for one of the respective some of the plurality of parallel tasks associated thereof; generating a parent thread associated with the host thread for communicating with the plurality of device threads; configuring a host processing circuitry to execute the parent thread; and configuring at least one other processing circuitry, connected to the host processing circuitry, to execute in parallel the plurality of device threads while the host processing circuitry executes the parent thread; and for at least one of the plurality of host threads: receiving by the parent thread associated with the at least one host thread at least one value from the at least one other processing circuitry, the at least one value generated by the at least one other processing circuitry executing at least one of the plurality of device threads associated with the at least one host thread. Executing on the host processing circuitry a parent thread associated with a host thread of the software program and with a plurality of device threads executing on one or more other processing circuitries while the parent thread executes allows increasing an amount of threads executing in parallel the plurality of parallel sub-tasks of the software program beyond the amount of parallel threads supported by an operating system of the system, reducing latency of executing the plurality of sub-tasks and additionally o