US-12626140-B2 - Systems and methods for online time series forecasting

US12626140B2US 12626140 B2US12626140 B2US 12626140B2US-12626140-B2

Abstract

Embodiments provide a framework combining fast and slow learning Networks (referred to as “FSNet”) to train deep neural forecasters on the fly for online time-series fore-casting. FSNet is built on a deep neural network backbone (slow learner) with two complementary components to facilitate fast adaptation to both new and recurrent concepts. To this end, FSNet employs a per-layer adapter to monitor each layer's contribution to the forecasting loss via its partial derivative. The adapter transforms each layer's weight and feature at each step based on its recent gradient, allowing a fine grain per-layer fast adaptation to optimize the current loss. In addition, FSNet employs a second and complementary associative memory component to store important, recurring patterns observed during training. The adapter interacts with the memory to store, update, and retrieve the previous transformations, facilitating fast learning of such patterns.

Inventors

Hong-Quang Pham
Chenghao LIU
Doyen Sahoo
Chu Hong Hoi

Assignees

SALESFORCE, INC.

Dates

Publication Date: 20260512
Application Date: 20220722

Claims (20)

1 . A method of forecasting time series data at future timestamps in a dynamic system, the method comprising: receiving, via a data interface, a time series dataset that includes a plurality of datapoints corresponding to a plurality of timestamps within a lookback time window; computing, at a first convolutional layer from a stack of convolutional layers, a first gradient based on an exponential moving average of gradients corresponding to the first convolutional layer; determining first adaptation parameters corresponding to the first convolutional layer based on mapping portions of the first gradient to elements of the first adaptation parameters, wherein the first adaptation parameters comprise a first weight adaptation component and a first feature adaptation component; computing an adapted feature map based at least in part on the first adaptation parameters and a previous adapted feature map from a preceding convolutional layer; generating, via a regressor, time series forecast data corresponding to a future time window based on a final feature map output from the stack of convolutional layers corresponding to the time series data within the lookback time window; for at least one convolutional layer in a temporal convolutional neural network: determining, based on the plurality of datapoints, a layer forecasting loss indicative of a loss contribution of the respective convolutional layer to an overall forecasting loss according to the plurality of datapoints, and updating the at least one convolutional layer based on the layer forecasting loss; computing a cosine similarity between the first gradient of the updated convolutional layer and a longer-term gradient associated with the at least one convolutional layer; in response to determining that determination that the cosine similarity is greater than a pre-predefined threshold: retrieving, from an indexed memory corresponding to the first convolutional layer, a current adaptation parameter, updating content stored at the indexed memory based on the current adaptation parameter and the first adaptation parameter, and updating the first adaptation parameters by taking a weighted average with the retrieved current adaptation parameter; computing a forecast loss based on the generated time series forecast data and ground-truth data corresponding to the future time window; and updating the stack of convolutional layers based on the forecast loss via backpropagation.
2 . The method of claim 1 , further comprising: computing an adapted layer parameter based on generating a first adapted weight based on the first weight adaptation component and a layer parameter corresponding to the first layer; and generating a feature map of the first convolutional layer with the first feature adaptation component.
3 . The method of claim 2 , wherein the adapted feature map is computed based on the first feature adaptation component and a first feature map of the first convolutional layer, and wherein the first feature map is a convolution of the adapted layer parameter and a previous adapted feature map from a preceding layer.
4 . The method of claim 3 , wherein the stack of convolutional layers and the regressor are updated by: updating the regressor via stochastic gradient descent; and updating, at the first convolutional layer, the first gradient and the first adaptation parameter.
5 . The method of claim 1 , further comprising: in response to determining that determination that the cosine similarity is greater than a pre-predefined threshold: trigger a memory read or write operation that captures a current pattern of gradients.
6 . The method of claim 5 , wherein the current pattern is captured by: computing attentions based on a current content of the memory and a current adaptation parameter; selecting a set of top relevant attentions from the computed attentions; and updating the current adaptation parameter by taking a weighted sum of the current content of the memory weighted by the set of top relevant attentions.
7 . The method of claim 6 , further comprising: performing a write operation to update and accumulate the current content of the memory based on the updated current adaptation parameter.
8 . A system for forecasting time series data at future timestamps in a dynamic system, the system comprising: a data interface that receives a time series dataset that includes a plurality of datapoints corresponding to a plurality of timestamps within a lookback time window; a memory that stores a plurality of processor-executable instructions; and a processor that reads from the memory and executes the instructions to perform operations comprising: computing, at a first convolutional layer from a stack of convolutional layers, a first gradient based on an exponential moving average of gradients corresponding to the first convolutional layer; determining first adaptation parameters corresponding to the first convolutional layer based on mapping portions of the first gradient to elements of the first adaptation parameters, wherein the first adaptation parameters comprise a first weight adaptation component and a first feature adaptation component; computing an adapted feature map based at least in part on the first adaptation parameters and a previous adapted feature map from a preceding convolutional layer; generating, via a regressor, time series forecast data corresponding to a future time window based on a final feature map output from the stack of convolutional layers corresponding to the time series data within the lookback time window; for at least one convolutional layer in a temporal convolutional neural network: determining, based on the plurality of datapoints, a layer forecasting loss indicative of a loss contribution of the respective convolutional layer to an overall forecasting loss according to the plurality of datapoints, and updating the at least one convolutional layer based on the layer forecasting loss; computing a cosine similarity between the first gradient of the updated convolutional layer and a longer-term gradient associated with the at least one convolutional layer; in response to determining that determination that the cosine similarity is greater than a pre-predefined threshold: retrieving, from an indexed memory corresponding to the first convolutional layer, a current adaptation parameter, updating content stored at the indexed memory based on the current adaptation parameter and the first adaptation parameter, and updating the first adaptation parameters by taking a weighted average with the retrieved current adaptation parameter; computing a forecast loss based on the generated time series forecast data and ground-truth data corresponding to the future time window; and updating the stack of convolutional layers based on the forecast loss via backpropagation.
9 . The system of claim 8 , wherein the operations further comprise: computing an adapted layer parameter based on generating a first adapted weight based on the first weight adaptation component and a layer parameter corresponding to the first layer; and generating a feature map of the first convolutional layer with the first feature adaptation component.
10 . The system of claim 9 , wherein the adapted feature map is computed based on the first feature adaptation component and a first feature map of the first convolutional layer, and wherein the first feature map is a convolution of the adapted layer parameter and a previous adapted feature map from a preceding layer.
11 . The system of claim 10 , wherein the stack of convolutional layers and the regressor are updated by: updating the regressor via stochastic gradient descent; and updating, at the first convolutional layer, the first gradient and the first adaptation parameter.
12 . The system of claim 8 , wherein the operations further comprise: in response to determining that determination that the cosine similarity is greater than a pre-predefined threshold: trigger a memory read or write operation that captures a current pattern of gradients.
13 . The system of claim 12 , wherein the current pattern is captured by: computing attentions based on a current content of the memory and a current adaptation parameter; selecting a set of top relevant attentions from the computed attentions; and updating the current adaptation parameter by taking a weighted sum of the current content of the memory weighted by the set of top relevant attentions.
14 . A non-transitory processor-readable storage medium storing processor-readable instructions for forecasting time series data at future timestamps in a dynamic system, the instructions being executed by a processor to perform operations comprising: receiving, via a data interface, a time series dataset that includes a plurality of datapoints corresponding to a plurality of timestamps within a lookback time window; computing, at a first convolutional layer from a stack of convolutional layers, a first gradient based on an exponential moving average of gradients corresponding to the first convolutional layer; determining first adaptation parameters corresponding to the first convolutional layer based on mapping portions of the first gradient to elements of the first adaptation parameters, wherein the first adaptation parameters comprise a first weight adaptation component and a first feature adaptation component; computing an adapted feature map based at least in part on the first adaptation parameters and a previous adapted feature map from a preceding convolutional layer; generating, via a regressor, time series forecast data corresponding to a future time window based on a final feature map output from the stack of convolutional layers corresponding to the time series data within the lookback time window; for at least one convolutional layer in a temporal convolutional neural network: determining, based on the plurality of datapoints, a layer forecasting loss indicative of a loss contribution of the respective convolutional layer to an overall forecasting loss according to the plurality of datapoints, and updating the at least one convolutional layer based on the layer forecasting loss; computing a cosine similarity between the first gradient of the updated convolutional layer and a longer-term gradient associated with the at least one convolutional layer; in response to determining that determination that the cosine similarity is greater than a pre-predefined threshold: retrieving, from an indexed memory corresponding to the first convolutional layer, a current adaptation parameter, updating content stored at the indexed memory based on the current adaptation parameter and the first adaptation parameter, and updating the first adaptation parameters by taking a weighted average with the retrieved current adaptation parameter; computing a forecast loss based on the generated time series forecast data and ground-truth data corresponding to the future time window; and updating the stack of convolutional layers based on the forecast loss via backpropagation.
15 . The non-transitory processor-readable storage medium of claim 14 , further comprising: computing an adapted layer parameter based on generating a first adapted weight based on the first weight adaptation component and a layer parameter corresponding to the first layer; and generating a feature map of the first convolutional layer with the first feature adaptation component.
16 . The non-transitory processor-readable storage medium of claim 15 , wherein the adapted feature map is computed based on the first feature adaptation component and a first feature map of the first convolutional layer, and wherein the first feature map is a convolution of the adapted layer parameter and a previous adapted feature map from a preceding layer.
17 . The non-transitory processor-readable storage medium of claim 16 , wherein the stack of convolutional layers and the regressor are updated by: updating the regressor via stochastic gradient descent; and updating, at the first convolutional layer, the first gradient and the first adaptation parameter.
18 . The non-transitory processor-readable storage medium of claim 14 , further comprising: in response to determining that determination that the cosine similarity is greater than a pre-predefined threshold: trigger a memory read or write operation that captures a current pattern of gradients.
19 . The non-transitory processor-readable storage medium of claim 18 , wherein the current pattern is captured by: computing attentions based on a current content of the memory and a current adaptation parameter; selecting a set of top relevant attentions from the computed attentions; and updating the current adaptation parameter by taking a weighted sum of the current content of the memory weighted by the set of top relevant attentions.
20 . The non-transitory processor-readable storage medium of claim 19 , further comprising: performing a write operation to update and accumulate the current content of the memory based on the updated current adaptation parameter.

Description

CROSS REFERENCE(S) The instant application is a nonprovisional of and claim priority under 35 U.S.C. 119 to U.S. provisional application No. 63/305,145, filed Jan. 31, 2022, which is hereby expressly incorporated by reference herein in its entirety. TECHNICAL FIELD The embodiments relate generally to machine learning systems, and more specifically to online time series forecasting. BACKGROUND Deep neural network models have been widely used in time series forecasting. For example, learning models may be used to forecast time series data such as continuous market data over a period of time in the future, weather data, and/or the like. Existing deep models adopt batch-learning for time series forecasting tasks. Such models often randomly sample look-back and forecast windows during training and freeze the model during evaluation, breaking the time varying (non-stationary) nature of time series. Therefore, there is a need for an efficient and adaptive deep learning framework for online time forecasting. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a simplified diagram illustrating an example structure of the FSNet framework for forecasting a time series, according to embodiments described herein. FIG. 2 is a simplified diagram illustrating an example structure of a TCN layer (block) of the FSNet framework described in FIG. 1, according to embodiments described herein. FIG. 3 is a simplified diagram illustrating an example structure of the dilated convolution layer in the TCN layer (block) described in FIG. 2, according to embodiments described herein. FIG. 4 is a simplified diagram of a computing device that implements the FSNet framework, according to some embodiments described herein. FIG. 5 is a simplified pseudo code segment for a fast and slow learning network implemented at the FSNet framework described in FIGS. 1-3, according to embodiments described here. FIG. 6 is a simplified logic flow diagram illustrating an example process corresponding to the pseudo code algorithm in FIG. 5, according to embodiments described herein. FIGS. 7-9 are example data charts and plots illustrating performance of the FSNet in example data experiments, according to embodiments described herein. In the figures, elements having the same designations have the same or similar functions. DETAILED DESCRIPTION As used herein, the term “network” may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith. As used herein, the term “module” may comprise hardware or software-based framework that performs one or more functions. In some embodiments, the module may be implemented on one or more neural networks. A time series is a set of values that correspond to a parameter of interest at different points in time. Examples of the parameter can include prices of stocks, temperature measurements, and the like. Time series forecasting is the process of determining a future datapoint or a set of future datapoints beyond the set of values in the time series. Time series forecasting of dynamic data via deep learning remains challenging. Embodiments provide a framework combining fast and slow learning Networks (referred to as “FSNet”) to train deep neural forecasters on the fly for online time-series fore-casting. FSNet is built on a deep neural network backbone (slow learner) with two complementary components to facilitate fast adaptation to both new and recurrent concepts. To this end, FSNet employs a per-layer adapter to monitor each layer's contribution to the forecasting loss via its partial derivative. The adapter transforms each layer's weight and feature at each step based on its recent gradient, allowing a fine grain per-layer fast adaptation to optimize the current loss. In addition, FSNet employs a second and complementary associative memory component to store important, recurring patterns observed during training. The adapter interacts with the memory to store, update, and retrieve the previous transformations, facilitating fast learning of such patterns. In this way, the FSNet framework can adapt to the fast-changing and the long-recurring patterns in time series. Specifically, in FSNet, the deep neural network plays the role of neocortex while the adapter and its memory play act as a hippocampus component. FSNet Framework Overview FIG. 1 is a simplified diagram illustrating an example structure of the FSNet framework 100 for forecasting a time series, according to embodiments described herein. The FSNet framework 100 comprises a plurality of convolution blocks 104a-n connected to a regressor 105. The FSNet framework 100 may receive time series data 102, denoted by χ=(x1, . . . , xT)∈ as a times series of T observations each having n dimensions, from an input interface such as a memory or a network adapter. In some embodiments, the time series data 102 may be data in a look back win