US-20260127434-A1 - ADAPTIVE NODE REMOVAL DURING TRAINING OF AN ARTIFICIAL NEURAL NETWORK
Abstract
Systems, methods, and computer program products for adaptive node removal during training of an artificial neural network are described herein. A method comprises normalizing each weight of an artificial neural network; determining an entropy value for each node in the artificial neural network based on the weights of that node's connections; identifying a set of candidate nodes, the candidate nodes having the highest entropy value of their respective layers; selecting a subset of the set of candidate nodes; and removing, from the artificial neural network, each candidate node of the subset.
Inventors
- Kyong Min Yeo
- Malgorzata Jadwiga Zimon
- Fausto Martelli
- Bruce Gordon Elmegreen
Assignees
- INTERNATIONAL BUSINESS MACHINES CORPORATION
Dates
- Publication Date
- 20260507
- Application Date
- 20241104
Claims (20)
- 1 . A method for adaptive node removal during training of an artificial neural network, the method comprising: normalizing each weight of an artificial neural network according to its neighborhood, the artificial neural network comprising a plurality of nodes, the nodes being organized into a plurality of layers, each of the plurality of nodes being connected to at least one node of at least one adjacent layer, wherein each connection has an associated weight; determining an entropy value for each node in the artificial neural network based on the weights of that node's connections; identifying a set of candidate nodes, wherein the set of candidate nodes comprises a node of each layer having the highest entropy value of that layer; selecting a subset of the set of candidate nodes, wherein each candidate node of the subset is connected to at least two other candidate nodes of the set; and removing, from the artificial neural network, each candidate node of the subset.
- 2 . The method of claim 1 , further comprising: identifying one or more nodes lacking incoming connections and/or outgoing connections according to the removal of the subset; and removing the one or more nodes from the artificial neural network.
- 3 . The method of claim 1 , wherein the artificial neural network is a feed-forward neural network.
- 4 . The method of claim 1 , wherein the plurality of layers comprises an input layer, an output layer, and at least one hidden layer, each node of the at least one hidden layer has at most four connections, each node of the input layer has one connection, and each node of the output layer has at most as many connections as the number of nodes in a preceding layer.
- 5 . The method of claim 4 , wherein each node of the at least one hidden layer has connections to at most two incoming nodes and to at most two outgoing nodes.
- 6 . The method of claim 1 , wherein the normalization of each weight comprises applying min-max normalization to each weight.
- 7 . The method of claim 1 , wherein the set of candidate nodes further comprises nodes having an entropy value of at least a threshold entropy value.
- 8 . The method of claim 7 , further comprising determining a current accuracy of the neural network, wherein the threshold entropy value is determined based on the current accuracy of the artificial neural network.
- 9 . A method of training an artificial neural network, the method comprising: training the artificial neural network for a plurality of epochs, wherein during at least one of the epochs, adaptive node removal is performed according to the method of claim 1 .
- 10 . A computer program product comprising: one or more computer-readable storage media; and program instructions stored on the one or more computer-readable storage media to perform operations comprising: normalizing each weight of an artificial neural network according to its neighborhood, the artificial neural network comprising a plurality of nodes, the nodes being organized into a plurality of layers, each of the plurality of nodes being connected to at least one node of at least one adjacent layer, wherein each connection has an associated weight; determining an entropy value for each node in the artificial neural network based on the weights of that node's connections; identifying a set of candidate nodes, wherein the set of candidate nodes comprises a node of each layer having the highest entropy value of that layer; selecting a subset of the set of candidate nodes, wherein each candidate node of the subset is connected to at least two other candidate nodes of the set; and removing from the artificial neural network each candidate node of the subset.
- 11 . The computer program product of claim 10 , wherein the operations further comprise: identifying one or more orphan nodes lacking incoming connections or outgoing connections according to the removal of the subset; and removing the one or more orphan nodes from the artificial neural network.
- 12 . The computer program product of claim 10 , wherein the plurality of layers comprises an input layer, an output layer, and at least one hidden layer, each node of the at least one hidden layer has at most four connections, each node of the input layer has one connection, and each node of the output layer has at most as many connections as the number of nodes in a preceding layer.
- 13 . The computer program product of claim 12 , wherein each node of the at least one hidden layer has connections to at most two incoming nodes and to at most two outgoing nodes.
- 14 . The computer program product of claim 10 , wherein the normalization of each weight comprises applying min-max normalization to each weight.
- 15 . The computer program product of claim 10 , wherein the operations are performed during at least one epoch of training the artificial neural network, wherein the artificial neural network is trained for a plurality of epochs.
- 16 . A computer system comprising: a processor set; one or more computer-readable storage media; and program instructions stored on the one or more computer-readable storage media to perform operations comprising: normalizing each weight of an artificial neural network according to its neighborhood, the artificial neural network comprising a plurality of nodes, the nodes being organized into a plurality of layers, each of the plurality of nodes being connected to at least one node of at least one adjacent layer, wherein each connection has an associated weight; determining an entropy value for each node in the artificial neural network based on the weights of that node's connections; identifying a set of candidate nodes, wherein the set of candidate nodes comprises a node of each layer having the highest entropy value of that layer; selecting a subset of the set of candidate nodes, wherein each candidate node of the subset is connected to at least two other candidate nodes of the set; and removing from the artificial neural network each candidate node of the subset.
- 17 . The computer system of claim 16 , wherein the operations further comprise: identifying one or more orphan nodes lacking incoming connections or outgoing connections according to the removal of the subset; and removing the one or more orphan nodes from the artificial neural network.
- 18 . The computer system of claim 16 , wherein the plurality of layers comprises an input layer, an output layer, and at least one hidden layer, each node of the at least one hidden layer has at most four connections, each node of the input layer has at most two connections, and each node of the output layer has at most as many connections as the number of nodes in a preceding layer.
- 19 . The computer system of claim 18 , wherein each node of the at least one hidden layer has connections to at most two incoming nodes and to at most two outgoing nodes.
- 20 . The computer system of claim 16 , wherein the operations are performed during at least one epoch of training the artificial neural network, wherein the artificial neural network is trained for a plurality of epochs.
Description
BACKGROUND Embodiments of the present disclosure relate to training artificial neural networks, and more specifically, to adaptive node removal during training of artificial neural networks. BRIEF SUMMARY According to embodiments of the present disclosure, systems, methods of and computer program products for adaptive node removal during training of an artificial neural network are provided. A method for adaptive node removal during training of an artificial neural network may comprise normalizing each weight of an artificial neural network according to its neighborhood. The artificial neural network may comprise a plurality of nodes. The nodes may be organized into a plurality of layers. Each of the plurality of nodes may be connected to at least one node of at least one adjacent layer. Each connection may have an associated weight. The method may comprise determining an entropy value for each node in the artificial neural network based on the weights of that node's connections. The method may comprise identifying a set of candidate nodes. The set of candidate nodes may comprise a node of each layer having the highest entropy value of that layer. The method may comprise selecting a subset of the set of candidate nodes. Each candidate node of the subset may be connected to at least two other candidate nodes of the set. The method may comprise removing, from the artificial neural network, each candidate node of the subset. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flow diagram depicting an exemplary method for adaptive node removal during training of an artificial neural network, in accordance with one or more embodiments of this disclosure. FIGS. 2A, 2B, and 2C depict an exemplary artificial neural network at different stages of training, in accordance with one or more embodiments of this disclosure. FIG. 3 depicts a computing node according to an embodiment of the present disclosure. DETAILED DESCRIPTION An artificial neural network is a collection of one or more nodes. An artificial neural network is often divided into groups of nodes called layers. A layer is a collection of one or more nodes that all receive input from the same layer(s) and all send output to the same layer(s). The layer(s) from which the one or more nodes all receive input and the layer(s) to which the one or more nodes all send output are the adjacent, or neighboring, layers of the layer. Two nodes between which information flows (e.g., output of one node is sent to and received as input by the other node) are connected. Each connection may have an associated weight. In some implementations, the associated weight of a connection may characterize the strength of the connection. A layer from which the one or more nodes all receive input is a preceding layer. A node's connection with a node of its preceding layer may be referred to as an incoming connection. A layer to which the one or more nodes all send output may be referred to as a following layer. A node's connection with a node of its following layer may be referred to as an outgoing connection. An input layer is a layer that receives input from a source outside the artificial neural network. An output layer is a layer that sends output to a target outside the artificial neural network. All other layers are intermediate processing layers (i.e., hidden layers). A multilayer neural network is an artificial neural network with more than one layer. A deep neural network is a multilayer neural network with many layers. A sparse neural network is an artificial neural network where a node is only connected to some (one or more) but not all nodes of its adjacent layer(s). In sparse neural networks, the information flow from the input layer to the output layer may be isolated and form paths. A tensor is a multidimensional array of numerical values. A tensor block is a contiguous subarray of the elements in a tensor. Each neural network layer is associated with a parameter tensor V, weight tensor W, input data tensor X, output data tensor Y, and intermediate data tensor Z. The parameter tensor contains all parameters that control node activation functions σ in the layer. The weight tensor contains all weights that connect inputs to the layer. The input data tensor contains all data that the layer consumes as input. The output data tensor contains all data that the layer computes as output. The intermediate data tensor contains any data that the layer produces as intermediate computations, such as partial sums. The data tensors (input, output, and intermediate) for a layer may be 3-dimensional, where the first two dimensions may be interpreted as encoding spatial location and the third dimension as encoding different features. For example, when a data tensor represents a color image, the first two dimensions encode vertical and horizontal coordinates within the image, and the third dimension encodes the color at each location. Every element of the input data tensor X can be connected to e