US-12619866-B2 - Methods and apparatus to facilitate continuous learning

US12619866B2US 12619866 B2US12619866 B2US 12619866B2US-12619866-B2

Abstract

Methods, apparatus, systems and articles of manufacture are disclosed to facilitate continuous learning. An example apparatus includes a trainer to train a first Bayesian neural network (BNN) and a second BNN, the first BNN associated with a first weight distribution and the second BNN associated with a second weight distribution. The example apparatus includes a weight determiner to determine a first sampling weight associated with the first BNN and a second sampling weight associated with the second BNN. The example apparatus includes a network sampler to sample at least one of the first weight distribution or the second weight distribution based on a pseudo-random number, the first sampling weight, and the second sampling weight. The example apparatus includes an inference controller to generate an ensemble weight distribution based on the sample.

Inventors

Nilesh Ahuja
Mahesh Subedar
Ranganath Krishnan
Ibrahima Ndiour
Omesh Tickoo

Assignees

INTEL CORPORATION

Dates

Publication Date: 20260505
Application Date: 20201223

Claims (19)

1 . An apparatus, comprising: interface circuitry; machine readable instructions; and programmable circuitry to at least one of instantiate or execute the machine readable instructions to: train a first Bayesian neural network (BNN) using initial training data, the first BNN associated with a first weight distribution; in response to obtaining additional training data, train a second BNN using the initial training data and the additional training data, the second BNN associated with a second weight distribution, wherein the first weight distribution and the second weight distribution are unimodal multivariate Gaussian distributions having respective means and standard deviations; determine a first sampling weight associated with the first BNN and a second sampling weight associated with the second BNN; sample a Gaussian distribution of at least one of the first weight distribution or the second weight distribution based on a pseudo-random number, the first sampling weight, and the second sampling weight; generate a Gaussian mixture model ensemble weight distribution based on the sample, the Gaussian mixture model ensemble weight distribution being part of an ensemble BNN system that includes the first BNN and the second BNN, the ensemble BNN system being less susceptible to catastrophic forgetting than the second BNN; and cause training of a third BNN based on the Gaussian mixture model ensemble weight distribution.
2 . The apparatus of claim 1 , wherein the programmable circuitry is to train the first BNN and the second BNN on a training dataset.
3 . The apparatus of claim 1 , wherein the programmable circuitry is to train the first BNN on a first subset of a training dataset and the second BNN on a second subset of the training dataset.
4 . The apparatus of claim 1 , wherein the first sampling weight and the second sampling weight sum to 1.
5 . The apparatus of claim 1 , wherein the instructions cause the programmable circuitry to determine the first sampling weight and the second sampling weight based on a proportion of samples of a class and a number of networks of the class.
6 . The apparatus of claim 1 , wherein the programmable circuitry is to determine an uncertainty of the Gaussian mixture model ensemble weight distribution, the uncertainty including an aleatoric uncertainty and an epistemic uncertainty.
7 . The apparatus of claim 6 , wherein the programmable circuitry is to identify out of distribution data based on the epistemic uncertainty.
8 . The apparatus of claim 1 , wherein the programmable circuitry is to: continue, in response to a determination that an uncertainty of the Gaussian mixture model ensemble weight distribution is less than a threshold uncertainty, to sample the at least one of the first weight distribution or the second weight distribution; and generate, in response to a determination that the uncertainty of the Gaussian mixture model ensemble weight distribution is greater or equal to the threshold uncertainty, a predictive distribution based on the samples.
9 . At least one non-transitory computer readable medium comprising instructions that, when executed, cause at least one processor to at least: train a first Bayesian neural network (BNN) using initial training data, the first BNN associated with a first weight distribution; in response to obtaining additional training data, train a second BNN using the initial training data and the additional training data, the second BNN associated with a second weight distribution, wherein the first weight distribution and the second weight distribution are unimodal multivariate Gaussian distributions having respective means and standard deviations; determine a first sampling weight associated with the first BNN and a second sampling weight associated with the second BNN; sample a Gaussian distribution of at least one of the first weight distribution or the second weight distribution based on a pseudo-random number, the first sampling weight, and the second sampling weight; generate a Gaussian mixture model ensemble weight distribution based on the sample, the Gaussian mixture model ensemble weight distribution being part of an ensemble BNN system that includes the first BNN and the second BNN, the ensemble BNN system being less susceptible to catastrophic forgetting than the second BNN; and cause training of a third BNN based on the Gaussian mixture model ensemble weight distribution.
10 . The at least one non-transitory computer readable medium of claim 9 , wherein the instructions, when executed, cause the at least one processor to train the first BNN and the second BNN on a training dataset.
11 . The at least one non-transitory computer readable medium of claim 9 , wherein the instructions, when executed, cause the at least one processor to train the first BNN on a first subset of a training dataset and the second BNN on a second subset of the training dataset.
12 . The at least one non-transitory computer readable medium of claim 9 , wherein the first sampling weight and the second sampling weight sum to 1.
13 . The at least one non-transitory computer readable medium of claim 9 , wherein the instructions, when executed, cause the at least one processor to determine the first sampling weight and the second sampling weight based on a proportion of samples of a class and a number of networks of the class.
14 . The at least one non-transitory computer readable medium of claim 9 , wherein the instructions, when executed, cause the at least one processor to determine an uncertainty of the Gaussian mixture model ensemble weight distribution, the uncertainty including an aleatoric uncertainty and an epistemic uncertainty.
15 . The at least one non-transitory computer readable medium of claim 14 , wherein the instructions, when executed, cause the at least one processor to identify out of distribution data based on the epistemic uncertainty.
16 . The at least one non-transitory computer readable medium of claim 9 , wherein the instructions, when executed, cause the at least one processor to: continue, in response to a determination that an uncertainty of the Gaussian mixture model ensemble weight distribution is less than a threshold uncertainty, to sample the at least one of the first weight distribution or the second weight distribution; and generate, in response to a determination that the uncertainty of the Gaussian mixture model ensemble weight distribution is greater or equal to the threshold uncertainty, a predictive distribution based on the samples.
17 . An apparatus, comprising: memory; and at least one processor to execute machine readable instructions to: train a first Bayesian neural network (BNN) using initial training data, the first BNN associated with a first weight distribution; in response to obtaining additional training data, train a second BNN using the initial training data and the additional training data, the second BNN associated with a second weight distribution, wherein the first weight distribution and the second weight distribution are unimodal multivariate Gaussian distributions having respective means and standard deviations; determine a first sampling weight associated with the first BNN and a second sampling weight associated with the second BNN; sample at least one of the first weight distribution or the second weight distribution based on a pseudo-random number, the first sampling weight, and the second sampling weight; generate a Gaussian mixture model ensemble weight distribution based on the sample, the Gaussian mixture model ensemble weight distribution being part of an ensemble BNN system that includes the first BNN and the second BNN, the ensemble BNN system being less susceptible to catastrophic forgetting than the second BNN; and cause training of a third BNN based on the Gaussian mixture model ensemble weight distribution.
18 . The apparatus of claim 17 , wherein the at least one processor is to train the first BNN and the second BNN on a training dataset.
19 . The apparatus of claim 17 , wherein the at least one processor is to: continue, in response to a determination that an uncertainty of the Gaussian mixture model ensemble weight distribution is less than a threshold uncertainty, sampling the at least one of the first weight distribution or the second weight distribution; and generate, in response to a determination that the uncertainty of the Gaussian mixture model ensemble weight distribution is greater or equal to the threshold uncertainty, a predictive distribution based on the samples.

Description

FIELD OF THE DISCLOSURE This disclosure relates generally to neural networks, and, more particularly, to methods and apparatus to facilitate efficient knowledge sharing among neural networks. BACKGROUND In recent years, machine learning and/or artificial intelligence have increased in popularity. For example, machine learning and/or artificial intelligence may be implemented using neural networks. Neural networks are computing systems inspired by the neural networks of human brains. A neural network can receive an input and generate an output. The neural network can be trained (e.g., can learn) based on feedback so that the output corresponds to a desired result. Once trained, the neural network can make decisions to generate an output based on any input. Neural networks are used for the emerging fields of artificial intelligence and/or machine learning. A Bayesian neural network is a particular type of neural network that includes neurons that output a variable weight as opposed to a fixed weight. The variable weight falls within a probability distribution defined by a mean value and a variance determined during training of the Bayesian neural network. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic illustration of an example Bayesian neural network. FIG. 2 is a schematic illustration of an example environment including an example first BNN system, an example second BNN system, and an example BNN ensemble controller to facilitate continuous learning in accordance with teachings of this disclosure. FIG. 3 is an example illustration of two-dimensional weight space. FIG. 4 is a schematic illustration of an example BNN ensemble system for continuous learning. FIG. 5 is an example illustration of weighted parameters. FIG. 6 is a flowchart representative of example machine readable instructions which may be executed to implement the example BNN ensemble controller of FIG. 2. FIG. 7 is a block diagram of an example processing platform structured to execute the instructions of FIG. 6 to implement the example BNN ensemble controller of FIG. 2. FIG. 8 is a block diagram of an example software distribution platform to distribute software (e.g., software corresponding to the example computer readable instructions of FIG. 6) to client devices. The figures are not to scale. In general, the same reference numbers will be used throughout the drawing(s) and accompanying written description to refer to the same or like parts. Unless specifically stated otherwise, descriptors such as “first,” “second,” “third,” etc. are used herein without imputing or otherwise indicating any meaning of priority, physical order, arrangement in a list, and/or ordering in any way, but are merely used as labels and/or arbitrary names to distinguish elements for ease of understanding the disclosed examples. In some examples, the descriptor “first” may be used to refer to an element in the detailed description, while the same element may be referred to in a claim with a different descriptor such as “second” or “third.” In such instances, it should be understood that such descriptors are used merely for identifying those elements distinctly that might, for example, otherwise share a same name. As used herein “substantially real time” refers to occurrence in a near instantaneous manner recognizing there may be real world delays for computing time, transmission, etc. Thus, unless otherwise specified, “substantially real time” refers to real time +/−1 second. DETAILED DESCRIPTION Ideally, deep neural networks (DNNs) deployed in real-world tasks should be able to recognize atypical inputs (e.g., inputs that would be considered out-of-distribution, anomalies, novel, etc.) to determine whether to ignore the inputs (e.g., because they are not relevant to the task) or learn from them. That is, DNNs perform out-of-distribution (OOD) detection and continuous learning on new inputs. However, prior DNNs are not suited for OOD detection and continuous learning. For example, when trying to identify OOD inputs, DNNs tend to give incorrect yet overconfident outcomes. Further, when trying to learn from new inputs by updating the weights, DNNs rapidly forget their old data. That is, DNNs experience catastrophic forgetting when learning from new inputs. In some examples, DNNs are not suited for OOD detection and continuous learning because the weights and parameters of DNNs are represented by single point estimates. Therefore, a single set of trained network weights do not capture the model uncertainty (e.g., the epistemic uncertainty) due to the lack of complete knowledge of the network's weights. Further, any deviation from this single set of weights results in network performance degradation on the previous training data (e.g., leading to catastrophic forgetting in continuous learning scenarios). Thus, a set of trained weights associated with a probability distribution can be marginalized during inference and better represent the complete knowledge of the