US-20260127488-A1 - TRAINING METHOD FOR MACHINE LEARNING MODEL AND HOST SYSTEM

US20260127488A1US 20260127488 A1US20260127488 A1US 20260127488A1US-20260127488-A1

Abstract

A training method for a machine learning model and a host system are provided. The host system includes a rewritable non-volatile memory module. The training method includes: executing a training process of the machine learning model, which includes, in an iteration at an epoch of the training process, storing transient data and backtracking data generated by the iteration in the rewritable non-volatile memory module; and in response to an abnormality occurring in the host system which causes an interruption in the iteration, reading the transient data and the backtracking data from the rewritable non-volatile memory module, determining a stage of the iteration based on the backtracking data, and resuming the stage according to the transient data.

Inventors

Yu-Hao Wang
Szu-Wei Chen
Jian Ping Syu
Hao-Zhi Lee
An-Cheng Liu

Assignees

PHISON ELECTRONICS CORP.

Dates

Publication Date: 20260507
Application Date: 20241204
Priority Date: 20241105

Claims (18)

1 . A training method for a machine learning model, adapted for a host system that comprises a rewritable non-volatile memory module, the training method comprising: executing a training process of the machine learning model, which comprises, in an iteration at an epoch of the training process, storing transient data and backtracking data generated by the iteration in the rewritable non-volatile memory module; and in response to an abnormality occurring in the host system which causes an interruption in the iteration, reading the transient data and the backtracking data from the rewritable non-volatile memory module, determining a stage of the iteration based on the backtracking data, and resuming the stage according to the transient data.
2 . The training method according to claim 1 , wherein the iteration comprises forward propagation, backward propagation, and an update stage, wherein storing the transient data and the backtracking data generated by the iteration in the rewritable non-volatile memory module comprises: in response to the forward propagation being completed, obtaining an output of a neuron in the machine learning model, setting the transient data to include the output of the neuron, setting the backtracking data to indicate that the forward propagation has been completed, and writing the output of the neuron and the backtracking data to the rewritable non-volatile memory module.
3 . The training method according to claim 2 , wherein determining the stage of the iteration based on the backtracking data, and resuming the stage according to the transient data comprises: setting the stage to the backward propagation according to the backtracking data; and reading the output of the neuron and a plurality of weights from the rewritable non-volatile memory module, and re-executing the backward propagation according to the output of the neuron and the weights.
4 . The training method according to claim 1 , wherein the iteration comprises forward propagation, backward propagation, and an update stage, wherein storing the transient data and the backtracking data generated by the iteration in the rewritable non-volatile memory module comprises: in response to the backward propagation being completed, setting the transient data to include a gradient, setting the backtracking data to indicate that the backward propagation has been completed, and writing the gradient and the backtracking data to the rewritable non-volatile memory module.
5 . The training method according to claim 4 , wherein determining the stage of the iteration based on the backtracking data, and resuming the stage according to the transient data comprises: setting the stage to the update stage according to the backtracking data; and reading the gradient and a plurality of weights from the rewritable non-volatile memory module, and re-executing the update stage according to the gradient and the weights.
6 . The training method according to claim 4 , wherein the machine learning model comprises a plurality of layers, and storing the transient data and the backtracking data generated by the iteration in the rewritable non-volatile memory module further comprises: after updating a first layer of the layers in the update stage, setting the transient data to further include a plurality of updated weights of the first layer, setting the backtracking data to indicate that the first layer has been updated, and writing the updated weights and the backtracking data to the rewritable non-volatile memory module.
7 . The training method according to claim 6 , wherein determining the stage of the iteration based on the backtracking data, and resuming the stage according to the transient data comprises: setting the stage to a second layer of the layers according to the backtracking data, wherein the second layer is different from the first layer; and reading the gradient and a plurality of weights of the second layer from the rewritable non-volatile memory module, and updating the weights of the second layer according to the gradient.
8 . The training method according to claim 1 , wherein the iteration comprises forward propagation, backward propagation, and an update stage, wherein storing the transient data and the backtracking data generated by the iteration in the rewritable non-volatile memory module comprises: in response to the update stage being completed, setting the transient data to include a plurality of updated weights and an updated optimization parameter, setting the backtracking data to indicate that the update stage has been completed, and writing the updated weights, the updated optimization parameter, and the backtracking data to the rewritable non-volatile memory module.
9 . The training method according to claim 8 , further comprising: in response to forward propagation of a subsequent iteration being interrupted, reading the updated weights and the updated optimization parameter from the rewritable non-volatile memory module; and re-executing the subsequent iteration according to the updated weights and the updated optimization parameter, wherein the subsequent iteration is executed after the iteration.
10 . A host system, comprising: a rewritable non-volatile memory module; and a processor electrically connected to the rewritable non-volatile memory module for: executing a training process of a machine learning model, which comprises, in an iteration at an epoch of the training process, storing transient data and backtracking data generated by the iteration in the rewritable non-volatile memory module; and in response to an abnormality occurring in the host system which causes an interruption in the iteration, reading the transient data and the backtracking data from the rewritable non-volatile memory module, determining a stage of the iteration based on the backtracking data, and resuming the stage according to the transient data.
11 . The host system according to claim 10 , wherein the iteration comprises forward propagation, backward propagation, and an update stage, wherein storing the transient data and the backtracking data generated by the iteration in the rewritable non-volatile memory module comprises: in response to the forward propagation being completed, obtaining an output of a neuron in the machine learning model, setting the transient data to include the output of the neuron, setting the backtracking data to indicate that the forward propagation has been completed, and writing the output of the neuron and the backtracking data to the rewritable non-volatile memory module.
12 . The host system according to claim 11 , wherein determining the stage of the iteration based on the backtracking data, and resuming the stage according to the transient data comprises: setting the stage to the backward propagation according to the backtracking data; and reading the output of the neuron and a plurality of weights from the rewritable non-volatile memory module, and re-executing the backward propagation according to the output of the neuron and the weights.
13 . The host system according to claim 10 , wherein the iteration comprises forward propagation, backward propagation, and an update stage, wherein storing the transient data and the backtracking data generated by the iteration in the rewritable non-volatile memory module comprises: in response to the backward propagation being completed, setting the transient data to include a gradient, setting the backtracking data to indicate that the backward propagation has been completed, and writing the gradient and the backtracking data to the rewritable non-volatile memory module.
14 . The host system according to claim 13 , wherein determining the stage of the iteration based on the backtracking data, and resuming the stage according to the transient data comprises: setting the stage to the update stage according to the backtracking data; and reading the gradient and a plurality of weights from the rewritable non-volatile memory module, and re-executing the update stage according to the gradient and the weights.
15 . The host system according to claim 13 , wherein the machine learning model comprises a plurality of layers, and storing the transient data and the backtracking data generated by the iteration in the rewritable non-volatile memory module further comprises: after updating a first layer of the layers in the update stage, setting the transient data to further include a plurality of updated weights of the first layer, setting the backtracking data to indicate that the first layer has been updated, and writing the updated weights and the backtracking data to the rewritable non-volatile memory module.
16 . The host system according to claim 15 , wherein determining the stage of the iteration based on the backtracking data, and resuming the stage according to the transient data comprises: setting the stage to a second layer of the layers according to the backtracking data, wherein the second layer is different from the first layer; and reading the gradient and a plurality of weights of the second layer from the rewritable non-volatile memory module, and updating the weights of the second layer according to the gradient.
17 . The host system according to claim 10 , wherein the iteration comprises forward propagation, backward propagation, and an update stage, wherein storing the transient data and the backtracking data generated by the iteration in the rewritable non-volatile memory module comprises: in response to the update stage being completed, setting the transient data to include a plurality of updated weights and an updated optimization parameter, setting the backtracking data to indicate that the update stage has been completed, and writing the updated weights, the updated optimization parameter, and the backtracking data to the rewritable non-volatile memory module.
18 . The host system according to claim 17 , wherein the processor further: in response to forward propagation of a subsequent iteration being interrupted, reads the updated weights and the updated optimization parameter from the rewritable non-volatile memory module; and re-executes the subsequent iteration according to the updated weights and the updated optimization parameter, wherein the subsequent iteration is executed after the iteration.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application claims the priority benefit of Taiwan application serial no. 113142309, filed on Nov. 5, 2024. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification. BACKGROUND Technical Field The disclosure relates to a training method for a machine learning model and a host system using a rewritable non-volatile memory module. Description of Related Art As artificial intelligence technology develops rapidly, deep learning models are applied in more and more fields, especially in fields such as natural language processing, image recognition, and speech recognition. However, training these complex models involves a large amount of data, resulting in a very time-consuming training process. Generally, the training process of deep learning models is divided into multiple epochs, with each epoch representing a complete traversal of the training dataset. During the training process, to reduce the impact of an unexpected interruption on the training progress, a checkpoint is usually set at the end of each epoch. If the system experiences an interruption or failure, the model may recover from the last checkpoint and re-execute the current epoch, thereby eliminating the need to start the training from the beginning. However, as the scale of datasets increases, the time required for each epoch also increases significantly. Even with the checkpoint mechanism, backtracking to the checkpoint and re-executing the epoch after an interruption still costs a considerable amount of time and computational resources. This problem is particularly prominent in the training of large datasets, especially when the model needs to iterate multiple times to achieve the desired accuracy. As a result, the loss in efficiency becomes more severe. SUMMARY An embodiment of the disclosure provides a training method for a machine learning model, which is adapted for a host system. The host system includes a rewritable non-volatile memory module. The training method includes: executing a training process of the machine learning model, which includes, in an iteration at an epoch of the training process, storing transient data and backtracking data generated by the iteration in the rewritable non-volatile memory module; and in response to an abnormality occurring in the host system which causes an interruption in the iteration, reading the transient data and the backtracking data from the rewritable non-volatile memory module, determining a stage of the iteration based on the backtracking data, and resuming the stage according to the transient data. In an embodiment of the disclosure, the iteration includes forward propagation, backward propagation, and an update stage. Storing the transient data and the backtracking data generated by the iteration in the rewritable non-volatile memory module includes: in response to the forward propagation being completed, obtaining an output of a neuron in the machine learning model, setting the transient data to include the output of the neuron, setting the backtracking data to indicate that the forward propagation has been completed, and writing the output of the neuron and the backtracking data to the rewritable non-volatile memory module. In an embodiment of the disclosure, determining the stage of the iteration based on the backtracking data, and resuming the stage according to the transient data includes: setting the stage to the backward propagation according to the backtracking data; and reading the output of the neuron and a plurality of weights from the rewritable non-volatile memory module, and re-executing the backward propagation according to the output of the neuron and the weights. In an embodiment of the disclosure, storing the transient data and the backtracking data generated by the iteration in the rewritable non-volatile memory module includes: in response to the backward propagation being completed, setting the transient data to include a gradient, setting the backtracking data to indicate that the backward propagation has been completed, and writing the gradient and the backtracking data to the rewritable non-volatile memory module. In an embodiment of the disclosure, determining the stage of the iteration based on the backtracking data, and resuming the stage according to the transient data includes: setting the stage to the update stage according to the backtracking data; and reading the gradient and a plurality of weights from the rewritable non-volatile memory module, and re-executing the update stage according to the gradient and the weights. In an embodiment of the disclosure, the machine learning model includes a plurality of layers, and storing the transient data and the backtracking data generated by the iteration in the rewritable non-volatile memory module further includes: after updating a first layer of the layers in the update stage, setting the transient