CN-116705008-B - Training method and device for intention understanding model for protecting privacy

CN116705008BCN 116705008 BCN116705008 BCN 116705008BCN-116705008-B

Abstract

The embodiments of the present specification disclose a training method of an intent understanding model to protect privacy, wherein the intent understanding model includes a speech encoder and a first decoder. The method comprises the steps of performing iterative training on any voice sample in a current batch of voice samples, processing the sample by utilizing a voice encoder and a decoder corresponding to a task type to which a label of the sample belongs in a plurality of decoders to obtain a voice prediction result, wherein the plurality of decoders comprise a first decoder and a plurality of second decoders for executing a plurality of privacy tasks, performing first training on the voice encoder and the plurality of decoders with the aim of reducing the gap between each voice prediction result and a corresponding voice sample label, and performing second training on an intention understanding model subjected to the first training with the aim of increasing the gap between the voice prediction result corresponding to each privacy task and the voice sample label and reducing the gap between the voice prediction result output by the first decoder and the voice sample label.

Inventors

HUANG WEI
Wang Yinggui
WANG LEI

Assignees

支付宝(杭州)信息技术有限公司

Dates

Publication Date: 20260505
Application Date: 20230711

Claims (15)

1. A training method of an intention understanding model for protecting privacy, wherein the intention understanding model comprises a voice encoder and a first decoder, and the method involves multiple rounds of iterative training, and any round of iterative training comprises: For any voice sample in the current batch of voice samples, processing the sample by utilizing the voice encoder and a decoder corresponding to the task type to which the label of the sample belongs in a plurality of decoders to obtain a voice prediction result, wherein the plurality of decoders comprise the first decoder and a plurality of second decoders for executing a plurality of privacy tasks; Performing a first training on the speech encoder and the plurality of decoders with the goal of reducing the gap between each speech prediction result and the corresponding speech sample label; And aiming at increasing the difference between the voice prediction result and the voice sample label corresponding to various privacy tasks and reducing the difference between the voice prediction result and the voice sample label output by the first decoder, performing second training on the intention understanding model subjected to the first training.
2. The method of claim 1, wherein the number of privacy tasks includes a speech recognition task and/or a speaker recognition task.
3. The method of claim 1, wherein the number of privacy tasks includes a speech recognition task, wherein prior to performing the plurality of rounds of iterative training, the method further comprises: the speech encoder and a second decoder for performing the speech recognition task are pre-trained using a set of training samples corresponding to the speech recognition task.
4. The method of claim 1, wherein the plurality of privacy tasks include a speech recognition task, wherein for any speech sample in a current batch of speech samples, processing the sample with the speech encoder and a decoder of the plurality of decoders that corresponds to a label task type for the sample to obtain a speech prediction result, comprising: For any voice sample, the corresponding original audio feature vector is segmented and shuffled according to the time dimension, so as to obtain an out-of-order audio feature vector; Processing the disordered audio feature vector by using the voice encoder to obtain hidden layer representation; And inputting the hidden layer representation into a decoder corresponding to the label task type of the sample to obtain the voice prediction result.
5. The method of claim 1, wherein the second training of the first trained intent understanding model with the aim of increasing a gap between the speech prediction result and the speech sample label corresponding to the various privacy tasks and decreasing a gap between the speech prediction result and the speech sample label output by the first decoder comprises: The method comprises determining an countermeasure training penalty positively correlated to a first penalty and negatively correlated to a number of second penalty corresponding to the number of privacy tasks, the first penalty being determined based on speech prediction results and speech sample tags output by the first decoder, each second penalty being determined based on speech prediction results and speech sample tags corresponding to its privacy tasks.
6. The method of claim 1, wherein model parameters of the number of second decoders are fixed during the second training of the first trained intent understanding model.
7. An intent understanding method of protecting privacy, comprising: acquiring a voice sample to be processed; inputting the voice sample into an intention understanding model trained by the method of claim 1 to obtain a corresponding intention prediction result.
8. A training method of a business prediction model for protecting privacy, wherein the business prediction model comprises an object encoder for a business object and a first decoder for executing a main task, and the method involves multiple rounds of iterative training, and any round of iterative training comprises: For any object sample in the current batch of object samples, processing the sample by using the object encoder and a decoder corresponding to the task type to which the label of the sample belongs in a plurality of decoders to obtain an object prediction result, wherein the plurality of decoders comprise the first decoder and a plurality of second decoders for executing a plurality of privacy tasks; Performing a first training on the object encoder and the plurality of decoders with the aim of reducing the gap between each object prediction result and the corresponding object sample label; And aiming at increasing the difference between the object prediction results and the object sample labels corresponding to various privacy tasks and reducing the difference between the object prediction results and the object sample labels output by the first decoder, performing second training on the business prediction model subjected to the first training.
9. The method of claim 8, wherein the business object is voice, image, or text.
10. The method of claim 8, wherein model parameters of the plurality of second decoders are fixed during the second training of the first trained traffic prediction model.
11. A business prediction method for protecting privacy comprises the following steps: Obtaining an object sample to be processed; inputting the object sample into a service prediction model trained by the method of claim 8 to obtain a corresponding service prediction result.
12. A training device for protecting a privacy intention understanding model comprises a voice encoder and a first decoder, wherein the device realizes iterative training of any round of iterative training in a plurality of rounds by the following units: the voice prediction unit is configured to process any voice sample in the current batch of voice samples by utilizing the voice encoder and a decoder corresponding to the task type to which the label of the sample belongs in a plurality of decoders to obtain a voice prediction result, wherein the plurality of decoders comprise a first decoder and a plurality of second decoders for executing a plurality of privacy tasks; a first training unit configured to perform a first training on the speech encoder and the plurality of decoders with the aim of reducing a gap between each speech prediction result and a corresponding speech sample label; And the second training unit is configured to perform second training on the intention understanding model subjected to the first training with the aim of increasing the gap between the voice prediction result and the voice sample label corresponding to various privacy tasks and reducing the gap between the voice prediction result and the voice sample label output by the first decoder.
13. A training device of a business prediction model for protecting privacy, wherein the business prediction model comprises an object encoder for a business object and a first decoder for executing a main task, and the device realizes the iterative training of any one of multiple rounds of iterative training by the following units: the object prediction unit is configured to process any object sample in the current batch of object samples by utilizing the object encoder and a decoder corresponding to the task type to which the label of the sample belongs in a plurality of decoders to obtain an object prediction result, wherein the plurality of decoders comprise the first decoder and a plurality of second decoders for executing a plurality of privacy tasks; a first training unit configured to perform a first training on the object encoder and the plurality of decoders with the aim of reducing a gap between each object prediction result and a corresponding object sample label; And the second training unit is configured to perform second training on the business prediction model subjected to the first training with the aim of increasing the difference between the object prediction results corresponding to various privacy tasks and the object sample labels and reducing the difference between the object prediction results output by the first decoder and the object sample labels.
14. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the method of any of claims 1-11.
15. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-11.

Description

Training method and device for intention understanding model for protecting privacy Technical Field One or more embodiments of the present disclosure relate to the field of machine learning technologies, and in particular, to a training method and apparatus for an intention understanding model for protecting privacy, a training method and apparatus for a business prediction model for protecting privacy, and a business prediction method and apparatus for protecting privacy. Background Voice intent understanding (Spoken Language Understanding, SLU for short) is a technology for analyzing user intent expressed by speaker voice, and is widely used in various scenes such as car-mounted voice, smart home, etc. The manner in which the SLU is implemented is typically by constructing the SLU model using machine learning (MACHINE LEARNING) techniques. However, in the current process of constructing and using the SLU model, there is a risk that intermediate data generated by the model is stolen to cause the privacy of the user to be revealed. In general, to protect user privacy, the availability of data needs to be sacrificed to a greater extent. Therefore, a solution is needed that can ensure or even improve the accuracy of the SLU prediction result while improving the protection strength of the user privacy. Disclosure of Invention The embodiment of the specification describes a training method and device for an intention understanding model for protecting privacy, and the training method and device can better meet the actual application requirements. According to a first aspect, there is provided a training method of an intent understanding model for protecting privacy, the intent understanding model comprising a speech encoder and a first decoder, the method involving a plurality of rounds of iterative training, wherein any round of iterative training comprises: And processing any voice sample in the current batch of voice samples by utilizing the voice encoder and a decoder corresponding to the task type to which the label of the sample belongs in a plurality of decoders to obtain a voice prediction result, wherein the plurality of decoders comprise the first decoder and a plurality of second decoders for executing a plurality of privacy tasks. The speech encoder and the plurality of decoders are first trained with the objective of reducing the gap between each speech prediction result and the corresponding speech sample label. And aiming at increasing the difference between the voice prediction result and the voice sample label corresponding to various privacy tasks and reducing the difference between the voice prediction result and the voice sample label output by the first decoder, performing second training on the intention understanding model subjected to the first training. In one embodiment, the number of privacy tasks includes a speech recognition task, and/or a speaker recognition task. In one embodiment, the number of privacy tasks includes a speech recognition task, wherein prior to performing the number of iterations of training, the method further comprises pre-training the speech encoder and a second decoder for performing the speech recognition task with a set of training samples corresponding to the speech recognition task. In one embodiment, the plurality of privacy tasks include a voice recognition task, wherein for any voice sample in a current batch of voice samples, processing the sample by using a decoder corresponding to a label task type of the sample in the voice encoder and the plurality of decoders to obtain a voice prediction result, and the method comprises the following steps: The method comprises the steps of carrying out segmentation and scrambling recombination on an original audio feature vector corresponding to any voice sample according to time dimension to obtain a disordered audio feature vector, utilizing a voice encoder to process the disordered audio feature vector to obtain a hidden layer representation, carrying out segmentation and scrambling recombination on the original audio feature vector according to time dimension to obtain a disordered audio feature vector, utilizing the voice encoder to process the disordered audio feature vector to obtain a hidden layer representation, and inputting the hidden layer representation into a decoder corresponding to a label task type of the sample to obtain a voice prediction result. In one embodiment, aiming at increasing the gap between the voice prediction result and the voice sample label corresponding to various privacy tasks and reducing the gap between the voice prediction result and the voice sample label output by the first decoder, performing second training on the intention understanding model after the first training, including: The method comprises determining an countermeasure training penalty positively correlated to a first penalty and negatively correlated to a number of second penalty corresponding to the number o