CN-115358288-B - Multi-mode classification model training method and device based on label constraint

CN115358288BCN 115358288 BCN115358288 BCN 115358288BCN-115358288-B

Abstract

The invention discloses a multi-mode classification model training method and device based on label constraint, the method comprises the steps of determining training data and corresponding data labels of a target mode used for training a model, inputting the training data into a converged data classification model to obtain training data characteristics corresponding to the training data, inputting the data labels into the trained label classification model to obtain label characteristics corresponding to the data labels, inputting the training data and the data labels into the data classification model to train, optimizing model parameters of the data classification model according to target loss function values in training until convergence to obtain the trained data classification model, and the target loss function values comprise characteristic difference degrees between the training data characteristics and the label characteristics. Therefore, the method and the device can enable the feature extraction of the model to have label distinction degree, and further enable the prediction effect of the model to be better.

Inventors

HUANG YUYAN
CHEN CHANGXIN

Assignees

有米科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20220719

Claims (7)

1. A method for training a multi-modal classification model based on label constraints, the method comprising: Determining training data and corresponding data labels of a target mode for training a model, wherein the target mode comprises at least one of an audio mode, an image mode and a text mode, the training data comprises at least one of audio data, image data and text data, and the data labels comprise at least one of data labels corresponding to the audio data, data labels corresponding to the image data and data labels corresponding to the text data; The training data is input into a converged data classification model to obtain training data characteristics corresponding to the training data, the data classification model is used for extracting characteristics of data of the target mode, the data classification model comprises at least one of an audio classification model, an image classification model and a text classification model, the audio classification model comprises at least one of a Speech Transformer model and a Conformer model, the image classification model comprises at least one of a CNN model, a ViT model and a CoTNet model, and the text classification model comprises at least one of a BERT model, a XLNet model and a RoBERTa model; inputting the data tag into a trained tag classification model to obtain tag characteristics corresponding to the data tag; The training data and the data labels are input into the data classification model for training, model parameters of the data classification model are optimized until convergence according to target loss function values in training, the trained data classification model is obtained, the target loss function values comprise feature difference degrees between training data features and label difference degrees between predicted labels output by the data classification model and the data labels, and the trained data classification model can extract data aiming at the target mode and data features with label distinction degrees.
2. The method for training a multi-modal classification model based on label constraint according to claim 1, wherein the inputting the data label into the trained label classification model to obtain the label feature corresponding to the data label comprises: generating a tag text comprising the data tag according to the data tag; and inputting the label text into a trained label classification model to obtain label characteristics corresponding to the label text, wherein the label classification model is obtained through training a training data set comprising a plurality of training label texts and corresponding training data labels.
3. The method for training a multi-modal classification model based on label constraint according to claim 1, wherein the objective loss function value is a weighted sum of the feature variance and the label variance, and the feature variance or the weight of the label variance is used to reduce a size gap between data values of the feature variance and the label variance.
4. The method for training a multi-modal classification model based on label constraints of claim 1, wherein the label variability is a cross entropy loss function and/or the feature variability is a KL divergence.
5. The method of claim 1, wherein the tag classification model is a transducer network-based classification model.
6. A multi-modal classification model training apparatus based on label constraints, the apparatus comprising: The system comprises a data determining module, a data processing module and a data processing module, wherein the data determining module is used for determining training data and corresponding data labels of a target mode for training a model, the target mode comprises at least one of an audio mode, an image mode and a text mode, the training data comprises at least one of the audio data, the image data and the text data, and the data labels comprise at least one of the data labels corresponding to the audio data, the data labels corresponding to the image data and the data labels corresponding to the text data; The feature extraction module is used for inputting the training data to a converged data classification model to obtain training data features corresponding to the training data, wherein the data classification model is used for extracting features of data of the target mode, the data classification model comprises at least one of an audio classification model, an image classification model and a text classification model, the audio classification model comprises at least one of a Speech Transformer model and a Conformer model, the image classification model comprises at least one of a CNN model, a ViT model and a CoTNet model, and the text classification model comprises at least one of a BERT model, a XLNet model and a RoBERTa model; the tag processing module is used for inputting the data tag into a trained tag classification model to obtain tag characteristics corresponding to the data tag; The model training module is used for inputting the training data and the data labels into the data classification model for training, optimizing model parameters of the data classification model according to objective loss function values in training until convergence to obtain the trained data classification model, wherein the objective loss function values comprise feature difference degrees between training data features and label difference degrees between predicted labels and the data labels output by the data classification model, and the trained data classification model can extract data aiming at the target mode and data features with label distinction degrees.
7. A multi-modal classification model training apparatus based on label constraints, the apparatus comprising: A memory storing executable program code; A processor coupled to the memory; The processor invokes the executable program code stored in the memory to perform the tag constraint based multimodal classification model training method of any of claims 1-5.

Description

Multi-mode classification model training method and device based on label constraint Technical Field The invention relates to the technical field of algorithm model training, in particular to a multi-mode classification model training method and device based on label constraint. Background With the development of algorithm technology, more and more enterprises begin to use algorithm models to perform data prediction tasks related to data classification, such as predicting associated categories or labels of data of specific modes, and the tasks need the algorithm models to fully extract features of the data and process the features. However, in the prior art, when training such a model, the degree of difference between the introduced label and the extracted feature is not considered, so that the degree of distinction between the extracted feature of the model and the degree of association of the label cannot be effectively improved in training, and the training effect is poor. It can be seen that the prior art has defects and needs to be solved. Disclosure of Invention The technical problem to be solved by the invention is to provide a multi-mode classification model training and determining method and device based on label constraint, which can enable the feature extraction of the model to have label distinction degree on one hand and enable the prediction effect of the model to be better on the other hand. In order to solve the technical problems, the first aspect of the invention discloses a multi-modal classification model training method based on label constraint, which comprises the following steps: determining training data and corresponding data labels for training a target modality of the model; inputting the training data into a converged data classification model to obtain training data characteristics corresponding to the training data, wherein the data classification model is used for extracting the characteristics of the data of the target mode; inputting the data tag into a trained tag classification model to obtain tag characteristics corresponding to the data tag; And inputting the training data and the data label into the data classification model for training, and optimizing model parameters of the data classification model according to a target loss function value in training until convergence to obtain the trained data classification model, wherein the target loss function value comprises characteristic difference degrees between the training data characteristics and the label characteristics. As an alternative embodiment, in the first aspect of the present invention, the target modality comprises at least one of an audio modality, an image modality, and a text modality, and/or the data classification model comprises at least one of an audio classification model, an image classification model, and a text classification model. In an optional implementation manner, in a first aspect of the present invention, the inputting the data tag into the trained tag classification model to obtain a tag feature corresponding to the data tag includes: generating a tag text comprising the data tag according to the data tag; and inputting the label text into a trained label classification model to obtain label characteristics corresponding to the label text, wherein the label classification model is obtained through training a training data set comprising a plurality of training label texts and corresponding training data labels. As an optional implementation manner, in the first aspect of the present invention, the objective loss function value includes the feature difference degree and a tag difference degree between a predicted tag output by the data classification model and the data tag. In a first aspect of the present invention, the objective loss function value is a weighted sum of the feature variance and the tag variance, and the feature variance or the weight of the tag variance is used to reduce a size gap between data values of the feature variance and the tag variance. In a first aspect of the present invention, as an alternative implementation manner, the tag variance is a cross entropy loss function, and/or the feature variance is a KL divergence. As an optional implementation manner, in the first aspect of the present invention, the tag classification model is a classification model based on a transducer network. As an alternative embodiment, in the first aspect of the present invention, the audio classification model includes at least one of Speech Transformer model and Conformer model, and/or the image classification model includes at least one of CNN model, viT model and CoTNet model, and/or the text classification model includes at least one of BERT model, XLNet model and RoBERTa model. The second aspect of the invention discloses a multi-modal classification model training device based on label constraint, which comprises: The data determining module is used for determining training