US-12619378-B2 - Optimization of memory use for efficient neural network execution

US12619378B2US 12619378 B2US12619378 B2US 12619378B2US-12619378-B2

Abstract

Implementations disclosed describe methods and systems to perform the methods of optimizing a size of memory used for accumulation of neural node outputs and for supporting multiple computational paths in neural networks. In one example, a size of memory used to perform neural layer computations is reduced by performing nodal computations in multiple batches, followed by rescaling and accumulation of nodal outputs. In another example, execution of parallel branches of neural node computations include evaluating, prior to the actual execution, the amount of memory resources needed to execute a particular order of branches sequentially and select the order that minimizes this amount or keeps this amount below a target threshold.

Inventors

Kaiping Li
Aidan Smyth
Ashutosh Pandey

Assignees

CYPRESS SEMICONDUCTOR CORPORATION

Dates

Publication Date: 20260505
Application Date: 20220325

Claims (20)

1 . A system comprising: a memory subsystem comprising: a first buffer and a second buffer; and a processing device communicatively coupled to the memory subsystem, the processing device to perform a sequential execution of a plurality of branches (PBs) of a neural network, each of the PBs connecting a branching node and an aggregation node, wherein at least one of the PBs comprises one or more intermediate nodes, and wherein to perform the sequential execution of the PBs, the processing device is to: store, in the first buffer, an output of one or more odd-numbered nodes of a first branch of the PBs; store, in the second buffer, an output of one or more even-numbered nodes of the first branch of the PBs; and wherein an order of the sequential execution of the PBs is determined in view of a combined minimum size of the first buffer and the second buffer supporting the sequential processing of the PBs.
2 . The system of claim 1 , wherein to perform the sequential execution of the PBs, the processing device is further to: store, in the first buffer or the second buffer, an output of one or more odd-numbered nodes of a second branch of the PBs.
3 . The system of claim 1 , wherein the memory subsystem further comprises a first scratch buffer, and wherein to perform the sequential execution of the PBs, the processing device is to: store, in the first buffer, an output of the branching node; and store, in the first scratch buffer, a copy of the output of the branching node.
4 . The system of claim 3 , wherein the memory subsystem further comprises a second scratch buffer, and wherein to perform the sequential execution of the PBs, the processing device is further to: store, in the second scratch buffer, an output of a last intermediate node of the first branch of the PBs.
5 . The system of claim 4 , to perform the sequential execution of the PBs, the processing device is further to: load, from the first scratch buffer, an input into a first intermediate node of the second branch of the PBs; and load, from the second scratch buffer, at least a portion of an input into the aggregation node.
6 . The system of claim 3 , wherein the memory subsystem further comprises a second scratch buffer, and wherein to perform the sequential execution of the PBs, the processing device is further to: interrupt execution of the first branch of the PBs; store, in the second scratch buffer, an intermediate output of the interrupted first branch of the PB; execute, using the output of the branching node stored in the first scratch buffer, at least a portion of the second branch of the PBs; and resume execution of the first branch of the PBs, using the intermediate output stored in the second scratch buffer.
7 . The system of claim 1 , wherein the memory subsystem further comprises: a plurality of scratch buffers; and wherein the order of the sequential execution of the PBs is determined in view of a plurality of values, each of the plurality of values representing a combined minimum size of the first buffer, the second buffer, and each one of the plurality of scratch buffers supporting a respective candidate order of a plurality of candidate orders of the sequential execution of the PBs, and wherein each of the plurality of candidate orders assigns an output of the branching node and an output of each of the intermediate nodes of the PBs to one of: the first buffer, the second buffer, or one of the plurality of scratch buffers.
8 . The system of claim 7 , wherein the plurality of scratch buffers comprises a number of scratch buffers that is at least a number of branches in the PBs that have at least one intermediate node.
9 . A system comprising: a memory subsystem comprising: an input buffer to store a plurality of input values into a layer of a neural network; a scratch buffer; and an output buffer; and a processing device communicatively coupled to the memory subsystem, the processing device to: compute, using the plurality of input values, a first set of output values of the layer; store the first set of output values in the scratch buffer; store a rescaled first set of output values in the output buffer; compute, using the plurality of input values, a second set of output values of the layer; overwrite the first set of output values in the scratch buffer with the second set of output values; store a rescaled second set of output values; and determine, using the rescaled first set of output values and the rescaled second set of output values, an output of the neural network.
10 . The system of claim 9 , wherein the first set of output values is rescaled using a first scaling factor and the second set of output values is rescaled using a second scaling factor that is different from the first scaling factor.
11 . The system of claim 10 , wherein the first scaling factor is determined in view of a size of at least some output values of the first set of output values.
12 . The system of claim 10 , wherein the first scaling factor is determined in view of a size of the scratch buffer and a size of the output buffer.
13 . The system of claim 10 , wherein at least some output values of the first set of output values are rescaled using different scaling factors.
14 . The system of claim 9 , wherein the processing device is further to compute M sets of output values including the first set of output values and the second set of output values, and wherein the processing device is to perform pipelined processing of the M sets of output values and determine one or more scaling factors for each of the M sets of output values.
15 . The system of claim 14 , wherein the processing device further comprises a scaling factor buffer to: store each of the one or more scaling factors for each of the M sets of output values.
16 . The system of claim 9 , wherein each of the first set of output values is in a first integer number format and each of the rescaled first set of output values is in a second integer number format.
17 . A method to generate an execution package for a neural network (NN), the method comprising: accessing an architecture of the NN, the NN comprising a plurality of branches (PBs), each of the PBs connecting a branching node and an aggregation node, wherein at least one of the PBs comprises one or more intermediate nodes; evaluating a plurality of candidate orders of sequential execution of the PBs, wherein each candidate order of sequential execution uses a first buffer to store an output of one or more odd-numbered nodes of a first branch of the PBs and a second buffer to store an output of one or more even-numbered nodes of the first branch of the PBs; selecting, from the plurality of candidate orders, a preferred order of sequential execution of the PBs in view of a combined minimum size of the first buffer and the second buffer supporting the sequential processing of the PBs; and generating the execution package for the NN comprising the selected preferred order of sequential execution of the PB.
18 . The method of claim 17 , wherein each candidate order of sequential execution uses the first buffer or the second buffer to store an output of one or more odd-numbered nodes of a second branch of the PBs.
19 . The method of claim 17 , wherein each candidate order of sequential execution uses: the first buffer to store an output of the branching node; and a first scratch buffer to store a copy of the output of the branching node.
20 . The method of claim 19 , wherein each candidate order of sequential execution uses a first scratch buffer to: load an input into a first intermediate node a second branch of the PBs; and uses a second scratch buffer to: store an output of a last intermediate node of the first branch of the PBs, and load at least a portion of an input into the aggregation node.

Description

RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application No. 63/231,158, filed Aug. 9, 2021, the entire contents of which is being incorporated herein by reference. TECHNICAL FIELD The instant disclosure pertains to optimization of memory resources used to support execution of machine learning models; more specifically, to minimizing a size of memory used for accumulation of neural node outputs and for supporting multiple computational paths in residual neural networks. BACKGROUND Edge computing is a type of a distributed computing in a cloud-based or server-based computing environment, where at least a portion of data processing occurs closer to a periphery of the environment where collection or consumption of data takes place. An edge device can be a computing device of relatively modest processing and memory capabilities and can have access to local data (e.g., via connected sensory devices, an Internet-of-Things, or IoT, network) and to a cloud service. Instead of uploading local data as input into the cloud service and then receiving a processing output from the cloud service, the edge device can in some instances process the local data using its own processor and memory resources. Even though the cloud service can be capable of processing the local data faster than the edge device, limitations of the network bandwidth can negate cloud processing gains. Local processing can have additional advantages, such as responding in real-time to changing conditions, reducing the computational load of the cloud service, decreasing network traffic, eliminating exposure of sensitive data to adversarial attacks, and so on. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of an example architecture of a computing environment that supports memory-optimized deployment of one or more machine learning models, in accordance with some implementations of the present disclosure. FIGS. 2A-D illustrates example topologies of portions of a neural network that include two or more parallel branches, in accordance with some implementations of the present disclosure. FIG. 2A illustrates two branches with one branch being a skipping connection branch that connects a branching node directly with an aggregation node and the other branch having two intermediate nodes. FIG. 2B illustrates two branches where both branches have intermediate nodes, e.g., a branch with two intermediate nodes and a branch with three intermediate nodes. FIG. 2C illustrates three parallel branches, with more intermediate nodes added to the topology of FIG. 2B. FIG. 2D illustrates topology that includes an intermediate aggregation node. FIGS. 3A-F are schematic depictions of various candidate orders of execution of an example portion of a neural network with parallel branches, in accordance with some implementations of the present disclosure. FIG. 3A illustrates an example topology of parallel branches. FIG. 3B illustrates one possible order of sequential execution of the parallel branches depicted in FIG. 3A, in accordance with some implementations of the present disclosure. FIG. 3C illustrates an alternative order of sequential execution of the parallel branches of FIG. 3A, in accordance with some implementations of the present disclosure. FIG. 3D illustrates one possible order of sequential execution of the parallel branches with temporary branch interruption, in accordance with some implementations of the present disclosure. FIG. 3E illustrates an example topology of a portion of a neural network with three parallel branches. FIG. 3F illustrates one possible order of sequential execution of three parallel branches of FIG. 3E, in accordance with some implementations of the present disclosure. FIG. 4 illustrates neural processing with accumulation of output values for optimization of the size of memory buffers that support neural network operations, in accordance with some implementations of the present disclosure. FIG. 5 is a flow diagram of an example method of deploying one or more neural networks for memory-optimized execution of parallel branches of neural connections, in accordance with some implementations of the present disclosure. FIG. 6 is a flow diagram of an example method of computation and accumulation of output values for optimization of the size of memory buffers that support neural network operations, in accordance with some implementations of the present disclosure. FIG. 7 depicts a block diagram of an example computer system operating in accordance with some implementations of the present disclosure. DETAILED DESCRIPTION Modern networks may connect together computing devices of very diverse processing capabilities. For example, a technological (e.g., manufacturing) line may include hundreds (or more) of wireless sensors connected to a local area network (LAN) and/or a personal area network (PAN). Groups of sensors may be served by a local (edge) processing device, such as a microcontroller unit (MC