CN-115345296-B - Training method and device for multitasking model

CN115345296BCN 115345296 BCN115345296 BCN 115345296BCN-115345296-B

Abstract

The embodiment of the specification provides a training method and device of a multi-task model, wherein the multi-task model comprises a backbone network for determining user characterization and k head networks for executing k user prediction tasks based on the user characterization. The method comprises the steps of determining k groups of original gradient vectors of k user prediction tasks aiming at a backbone network based on m user samples, mapping the k groups of original gradient vectors to subspaces of original spaces where the k groups of original gradient vectors are located to obtain k groups of mapped gradient vectors, determining corresponding r weights based on component distribution of the k groups of mapped gradient vectors in r space dimensions of the subspaces, respectively carrying out weighting processing on r dimension components of the mapping gradient vectors by utilizing the r weights to obtain k groups of weighted gradient vectors, and mapping the k groups of weighted gradient vectors back to the original spaces to obtain k groups of processed gradient vectors for updating network parameters of the backbone network.

Inventors

DONG XIN
WU RUIZE
LI HAI
XIONG CHAO
CHENG LEI
HE YONG
ZHANG LIANG
MO LINJIAN

Assignees

支付宝(杭州)信息技术有限公司

Dates

Publication Date: 20260508
Application Date: 20220812

Claims (20)

1. A training method of a multitasking model, the multitasking model comprising a backbone network and k head networks, the backbone network being used to determine user tokens, the k head networks being used to perform k user prediction tasks based on the user tokens, the k user prediction tasks including predicting whether a user will make k preset actions on a target advertisement, the k preset actions including clicking and converting, the method comprising: determining k groups of original gradient vectors of the k user prediction tasks aiming at the backbone network based on m user samples, wherein each user sample comprises user characteristics and k user labels corresponding to the k user prediction tasks, and advertisement characteristics of the target advertisement, wherein the advertisement characteristics comprise advertisement forms, and the advertisement forms comprise videos or pictures or words; mapping the k groups of original gradient vectors to subspaces of the original space where the k groups of original gradient vectors are located to obtain k groups of mapped gradient vectors; Based on the component distribution of the k groups of mapping gradient vectors in r space dimensions of the subspace, determining corresponding r weights, and respectively carrying out weighting processing on r dimension components of each mapping gradient vector by utilizing the r weights to obtain k groups of weighting gradient vectors; Mapping the k groups of weighted gradient vectors back to the original space to obtain k groups of processing gradient vectors; And updating network parameters of the backbone network by using the k groups of processing gradient vectors.
2. The method of claim 1, wherein determining k sets of original gradient vectors for the backbone network for the k user prediction tasks based on m user samples comprises: Determining k original gradient vectors of the k user prediction tasks aiming at the backbone network based on the user samples; the k sets of original gradient vectors are determined based on the k original gradient vectors.
3. The method of claim 2, wherein determining k original gradient vectors for the k user prediction tasks for the backbone network based on the respective user samples comprises: Inputting user characteristics in each user sample into the multi-task model to obtain k prediction results; Determining k training losses corresponding to the k user prediction tasks based on the k prediction results and the corresponding k user class labels; The k original gradient vectors are determined based on the k training losses and network parameters of the k head networks.
4. The method of claim 2, wherein determining the k sets of original gradient vectors based on the k original gradient vectors comprises: and correspondingly classifying the k original gradient vectors into k groups of original gradient vectors, wherein each group of determined original gradient vectors comprises m original gradient vectors corresponding to the m user samples.
5. The method of claim 2, wherein determining the k sets of original gradient vectors based on the k original gradient vectors comprises: for each user prediction task, calculating an average gradient vector based on m original gradient vectors corresponding to the task determined by the m user samples, and taking the average gradient vector as a group of original gradient vectors corresponding to the task.
6. The method of claim 1, wherein mapping the k sets of original gradient vectors to subspaces of the original space in which they reside, resulting in k sets of mapped gradient vectors, comprises: generating a group of orthogonal bases based on the k groups of original gradient vectors to form the subspace; And determining coordinate values of k groups of original gradient vectors on the group of orthogonal bases to form the k groups of mapping gradient vectors.
7. The method of claim 6, wherein generating a set of orthogonal basis based on the k sets of original gradient vectors comprises: Singular Value Decomposition (SVD) is carried out on a matrix formed by the k groups of original gradient vectors, and a plurality of non-zero row vectors contained in the right singular matrix obtained by decomposition are used as the group of orthogonal bases.
8. The method of claim 1, wherein determining corresponding r weights based on component distributions of the k sets of mapping gradient vectors over r spatial dimensions of the subspace comprises: for each spatial dimension, determining the weight corresponding to the spatial dimension based on the positive and negative sign number of the k groups of mapping gradient vectors on a plurality of components in the dimension.
9. The method of claim 8, wherein determining the weight corresponding to the spatial dimension based on the number of positive and negative signs of the plurality of component vectors of the k sets of mapping gradient vectors over the dimension comprises: in the case that the number of positive signs is 0 or the number of negative signs is 0, the weight is set to 1, otherwise to 0, or, And calculating an absolute difference value between the positive number and the negative number, and taking a ratio between the absolute difference value and the number of the components as the weight.
10. The method of claim 1, wherein updating network parameters of the backbone network with the k sets of processing gradient vectors comprises: calculating an average gradient vector based on the k sets of processing gradient vectors; updating the current network parameters of the backbone network to be the difference value obtained by subtracting the product between a preset learning rate and the average gradient vector.
11. The method of claim 1, wherein the k user prediction tasks include user click rate prediction and user conversion rate prediction.
12. A training method of a multitasking model comprising a backbone network for determining user characterizations and k head networks for performing k user predictive tasks based on the user characterizations, the k user predictive tasks including emotion tendency prediction and user intent prediction for a user session, the method comprising: determining k groups of original gradient vectors of the k user prediction tasks aiming at the backbone network based on m user samples, wherein each user sample comprises user characteristics and k user labels corresponding to the k user prediction tasks, the user characteristics comprise user session contents, and the k user labels comprise emotion type labels and user intention labels; mapping the k groups of original gradient vectors to subspaces of the original space where the k groups of original gradient vectors are located to obtain k groups of mapped gradient vectors; Based on the component distribution of the k groups of mapping gradient vectors in r space dimensions of the subspace, determining corresponding r weights, and respectively carrying out weighting processing on r dimension components of each mapping gradient vector by utilizing the r weights to obtain k groups of weighting gradient vectors; Mapping the k groups of weighted gradient vectors back to the original space to obtain k groups of processing gradient vectors; And updating network parameters of the backbone network by using the k groups of processing gradient vectors.
13. A method of training a multitasking model comprising a backbone network for determining an event characterization for an event and k head networks for performing k event prediction tasks based on the event characterization, the k event prediction tasks comprising risk prediction and intervention mode prediction, the method comprising: Determining k groups of original gradient vectors of the k event prediction tasks aiming at the backbone network based on m event samples, wherein each event sample comprises event characteristics and k event labels corresponding to the k event prediction tasks, the event characteristics comprise at least one of occurrence time, occurrence place and terminal equipment model, and the k event labels comprise risk labels and intervention mode labels; mapping the k groups of original gradient vectors to subspaces of the original space where the k groups of original gradient vectors are located to obtain k groups of mapped gradient vectors; Based on the component distribution of the k groups of mapping gradient vectors in r space dimensions of the subspace, determining corresponding r weights, and respectively carrying out weighting processing on r dimension components of each mapping gradient vector by utilizing the r weights to obtain k groups of weighting gradient vectors; Mapping the k groups of weighted gradient vectors back to the original space to obtain k groups of processing gradient vectors; And updating network parameters of the backbone network by using the k groups of processing gradient vectors.
14. A training method of a multi-task model, the multi-task model comprising a backbone network and k head networks, the backbone network being used for determining commodity characterization for commodities, the k head networks being used for executing k commodity prediction tasks based on the commodity characterization, the k commodity prediction tasks comprising target crowd prediction and sales prediction, the method comprising: Determining k groups of original gradient vectors of the k commodity prediction tasks aiming at the backbone network based on m commodity samples, wherein each commodity sample comprises commodity characteristics and k commodity labels corresponding to the k commodity prediction tasks, the commodity characteristics comprise at least one of category, production place, quality guarantee period, cost and manufacturer, and the k commodity labels comprise target crowd labels and sales volume labels; mapping the k groups of original gradient vectors to subspaces of the original space where the k groups of original gradient vectors are located to obtain k groups of mapped gradient vectors; Based on the component distribution of the k groups of mapping gradient vectors in r space dimensions of the subspace, determining corresponding r weights, and respectively carrying out weighting processing on r dimension components of each mapping gradient vector by utilizing the r weights to obtain k groups of weighting gradient vectors; Mapping the k groups of weighted gradient vectors back to the original space to obtain k groups of processing gradient vectors; And updating network parameters of the backbone network by using the k groups of processing gradient vectors.
15. A training method of a multitasking model, the multitasking model comprising a backbone network and k head networks, the backbone network for determining user characterizations, the k head networks for performing k user predictive tasks based on the user characterizations, the k user predictive tasks including predicting whether a user will make k preset actions on a target commodity, the k preset actions including at least two of browsing commodity information, joining shopping carts, sharing, collecting, ordering and paying, the method comprising: Determining k groups of original gradient vectors of the k user prediction tasks aiming at the backbone network based on m user samples, wherein each user sample comprises user characteristics, k user labels corresponding to the k user prediction tasks and commodity characteristics of the target commodity, wherein the user characteristics comprise behavior characteristics of users aiming at historical commodities; mapping the k groups of original gradient vectors to subspaces of the original space where the k groups of original gradient vectors are located to obtain k groups of mapped gradient vectors; Based on the component distribution of the k groups of mapping gradient vectors in r space dimensions of the subspace, determining corresponding r weights, and respectively carrying out weighting processing on r dimension components of each mapping gradient vector by utilizing the r weights to obtain k groups of weighting gradient vectors; Mapping the k groups of weighted gradient vectors back to the original space to obtain k groups of processing gradient vectors; And updating network parameters of the backbone network by using the k groups of processing gradient vectors.
16. A training apparatus for a multitasking model, the multitasking model comprising a backbone network for determining user tokens and k head networks for performing k user-predictive tasks based on the user tokens, the k user-predictive tasks including predicting whether a user will make k preset actions on a targeted advertisement, the k preset actions including clicking and converting, the apparatus comprising: The system comprises an original gradient determining unit, a target advertisement and a target advertisement, wherein the original gradient determining unit is configured to determine k groups of original gradient vectors of the k user prediction tasks aiming at the backbone network based on m user samples, each user sample comprises user characteristics and k user labels corresponding to the k user prediction tasks, and the advertisement characteristics of the target advertisement comprise advertisement forms, and the advertisement forms comprise videos or pictures or words; the first gradient mapping unit is configured to map the k groups of original gradient vectors to subspaces of the original space where the k groups of original gradient vectors are located, so as to obtain k groups of mapped gradient vectors; A dimension weight determining unit configured to determine corresponding r weights based on component distribution of the k sets of mapping gradient vectors in r spatial dimensions of the subspace; The gradient weighting unit is configured to respectively carry out weighting processing on r dimension components of each mapping gradient vector by utilizing the r weights to obtain k groups of weighted gradient vectors; the second gradient mapping unit is configured to map the k groups of weighted gradient vectors back to the original space to obtain k groups of processing gradient vectors; and a parameter updating unit configured to update network parameters of the backbone network using the k sets of processing gradient vectors.
17. A training apparatus of a multitasking model comprising a backbone network for determining user characterizations and k head networks for performing k user predictive tasks based on the user characterizations, the k user predictive tasks including emotion tendency prediction and user intent prediction for a user session, the training apparatus comprising: The original gradient determining unit is configured to determine k groups of original gradient vectors of the k user prediction tasks aiming at the backbone network based on m user samples, wherein each user sample comprises user characteristics and k user labels corresponding to the k user prediction tasks, the user characteristics comprise user session contents, and the k user labels comprise emotion type labels and user intention labels; the first gradient mapping unit is configured to map the k groups of original gradient vectors to subspaces of the original space where the k groups of original gradient vectors are located, so as to obtain k groups of mapped gradient vectors; The dimension weight determining unit is configured to determine corresponding r weights based on component distribution of the k groups of mapping gradient vectors in r space dimensions of the subspace, and respectively carry out weighting processing on r dimension components of each mapping gradient vector by utilizing the r weights to obtain k groups of weighting gradient vectors; The second gradient mapping unit is configured as a gradient weighting unit and is configured to map the k groups of weighted gradient vectors back to the original space to obtain k groups of processing gradient vectors; and a parameter updating unit configured to update network parameters of the backbone network using the k sets of processing gradient vectors.
18. A training apparatus for a multitasking model, the multitasking model comprising a backbone network for determining an event characterization for an event and k head networks for performing k event prediction tasks based on the event characterization, the k event prediction tasks comprising risk prediction and intervention mode prediction, the apparatus comprising: the system comprises an original gradient determining unit, an original gradient determining unit and a control unit, wherein the original gradient determining unit is configured to determine k groups of original gradient vectors of the k event prediction tasks aiming at the backbone network based on m event samples, each event sample comprises event characteristics and k event labels corresponding to the k event prediction tasks, and the event characteristics comprise at least one of occurrence time, occurrence place and terminal equipment model; the first gradient mapping unit is configured to map the k groups of original gradient vectors to subspaces of the original space where the k groups of original gradient vectors are located, so as to obtain k groups of mapped gradient vectors; A dimension weight determining unit configured to determine corresponding r weights based on component distribution of the k sets of mapping gradient vectors in r spatial dimensions of the subspace; The gradient weighting unit is configured to respectively carry out weighting processing on r dimension components of each mapping gradient vector by utilizing the r weights to obtain k groups of weighted gradient vectors; the second gradient mapping unit is configured to map the k groups of weighted gradient vectors back to the original space to obtain k groups of processing gradient vectors; and a parameter updating unit configured to update network parameters of the backbone network using the k sets of processing gradient vectors.
19. A training apparatus for a multitasking model, the multitasking model comprising a backbone network for determining a commodity representation for a commodity and k head networks for performing k commodity prediction tasks based on the commodity representation, the k commodity prediction tasks comprising risk prediction and intervention mode prediction, the apparatus comprising: The system comprises an original gradient determining unit, a main network and a network management unit, wherein the original gradient determining unit is configured to determine k groups of original gradient vectors of the k commodity prediction tasks aiming at the main network based on m commodity samples, each commodity sample comprises commodity characteristics and k commodity labels corresponding to the k commodity prediction tasks, the commodity characteristics comprise at least one of category, production place, quality guarantee period, cost and manufacturer, and the k commodity labels comprise target crowd labels and sales volume labels; the first gradient mapping unit is configured to map the k groups of original gradient vectors to subspaces of the original space where the k groups of original gradient vectors are located, so as to obtain k groups of mapped gradient vectors; A dimension weight determining unit configured to determine corresponding r weights based on component distribution of the k sets of mapping gradient vectors in r spatial dimensions of the subspace; The gradient weighting unit is configured to respectively carry out weighting processing on r dimension components of each mapping gradient vector by utilizing the r weights to obtain k groups of weighted gradient vectors; the second gradient mapping unit is configured to map the k groups of weighted gradient vectors back to the original space to obtain k groups of processing gradient vectors; and a parameter updating unit configured to update network parameters of the backbone network using the k sets of processing gradient vectors.
20. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-15.

Description

Training method and device for multitasking model Technical Field One or more embodiments of the present disclosure relate to the field of machine learning technologies, and in particular, to a method and apparatus for training a multitasking model. Background With the development of society and the advancement of science and technology, more and more service platforms provide various services for users to meet various demands of users in work and life. In order to realize thousands of people and thousands of faces, the service platform adopts a machine learning technology to predict multiple aspects of behaviors of users, such as predicting whether the users will endorse, collect and forward a certain article. Multitasking (Multitask learning) is a machine learning approach that learns multiple related tasks together based on a shared representation (shared representation), enabling training data sharing between different tasks. Thus, it is proposed to construct a multitasking model for multitasking predictions for the user. However, current multi-task prediction has limited prediction effect, and it is difficult to meet higher requirements in practical applications. Therefore, a scheme is needed, which can optimize the multi-task prediction effect for the user, and further effectively improve the user experience. Disclosure of Invention The embodiment of the specification describes a training method and a training device for a multi-task model, which can avoid negative migration among different prediction tasks, and further optimize the execution effect of a plurality of prediction tasks at the same time. According to a first aspect, a training method of a multi-task model is provided, the multi-task model comprises a main network and k head networks, the main network is used for determining user characterization, the k head networks are used for executing k user prediction tasks based on the user characterization, the method comprises the steps of determining k groups of original gradient vectors of the k user prediction tasks aiming at the main network based on m user samples, wherein each user sample comprises user characteristics and k user labels corresponding to the k user prediction tasks, mapping the k groups of original gradient vectors to subspaces of the original spaces where the k groups of original gradient vectors are located to obtain k groups of mapping gradient vectors, determining corresponding r weights based on component distribution of the k groups of mapping gradient vectors in r space dimensions of the subspaces, respectively carrying out weighted processing on r dimension components of each mapping gradient vector by utilizing the r weights to obtain k groups of weighted gradient vectors, mapping the k groups of weighted gradient vectors back to the original spaces to obtain k groups of processing gradient vectors, and updating network parameters of the main network by utilizing the k groups of processing gradient vectors. In one embodiment, determining k sets of original gradient vectors for the backbone network for the k user-prediction tasks based on m user samples includes determining k original gradient vectors for the backbone network for the k user-prediction tasks based on the respective user samples, and determining the k sets of original gradient vectors based on the k original gradient vectors. In a specific embodiment, k original gradient vectors of the k user prediction tasks aiming at the backbone network are determined based on the user samples, user characteristics in the user samples are input into the multi-task model to obtain k prediction results, k training losses corresponding to the k user prediction tasks are determined based on the k prediction results and the corresponding k user category labels, and the k original gradient vectors are determined based on the k training losses and network parameters of the k head networks. In a specific embodiment, the k groups of original gradient vectors are determined based on the k original gradient vectors, and the k original gradient vectors are correspondingly classified into the k groups of original gradient vectors, wherein each determined group of original gradient vectors comprises m original gradient vectors corresponding to the m user samples. In a specific embodiment, determining the k sets of original gradient vectors based on the k original gradient vectors includes, for each user prediction task, calculating an average gradient vector based on the m original gradient vectors determined for the m user samples and corresponding to the task as a set of original gradient vectors corresponding to the task. In one embodiment, the k groups of original gradient vectors are mapped to subspaces of the original space where the k groups of original gradient vectors are located to obtain k groups of mapped gradient vectors, wherein the k groups of mapped gradient vectors are formed by generating a group of orthogonal b