CN-122024728-A - Fine tuning method, device and storage medium of wake-up word

CN122024728ACN 122024728 ACN122024728 ACN 122024728ACN-122024728-A

Abstract

When a sample voice signal input by a user as a wake-up word is received, inputting the sample voice signal into a voice feature network to extract a first sample wake-up voice feature; if the sample voice signal is not a dialect, the user is monitored for a repeated awakening event, if the first reference voice signal is a dialect, the accent features when the user speaks the dialect are separated from the first reference awakening voice features according to the first sample awakening voice features, the accent features are fused into the first sample awakening voice features according to the voice feature network to obtain second sample awakening voice features, and when the target voice signal is received, awakening operation is executed according to the target voice signal, the first sample awakening voice features and the second sample awakening voice features. The embodiment enables the wake-up words to adapt to the accent change when the user speaks, thereby improving the success rate of using the wake-up words to wake-up the equipment.

Inventors

LI RONGFENG
RUAN QUNZHI
QIU YINGWEI
TANG JINYUN
Yang Biwan
TANG SHIYU

Assignees

广州易而达科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (10)

1. A method for fine tuning a wake-up word, comprising: When a sample voice signal recorded by a user as a wake-up word is received, inputting the sample voice signal into a voice feature network to extract a first sample wake-up voice feature; If the sample voice signal is not dialect, monitoring a repeated wake-up event for the user, wherein the repeated wake-up event is that wake-up operation is successful after wake-up operation failure; inquiring a first reference awakening voice characteristic extracted from the voice characteristic network by a first reference voice signal when awakening operation fails; if the first reference voice signal is a dialect, separating an accent feature when the user speaks the dialect from the first reference wake-up voice feature according to the first sample wake-up voice feature; According to the voice feature network, the accent features are fused into the first sample awakening voice features, and second sample awakening voice features are obtained; and when a target voice signal is received, executing a wake-up operation according to the target voice signal, the first sample wake-up voice feature and the second sample wake-up voice feature.
2. The method of claim 1, wherein listening for the user for a repeated wake-up event comprises: When a first reference voice signal is received, inputting the first reference voice signal into the voice feature network to extract a first reference wake-up voice feature; If the similarity between the first reference wake-up voice feature and the first sample wake-up voice feature is greater than or equal to the first confidence coefficient and smaller than the second confidence coefficient, determining that the wake-up operation fails, and starting a timer for timing; When a second reference voice signal is received before the starting timer finishes timing, inputting the second reference voice signal into the voice feature network to extract a second reference wake-up voice feature; And if the similarity between the second reference wake-up voice feature and the first sample wake-up voice feature is greater than or equal to a second confidence level, executing a wake-up operation, and determining that a repeated wake-up event occurs.
3. The method of claim 1, wherein the separating accent features of the user speaking dialect from the first reference wake speech features in accordance with the first sample wake speech features comprises: calculating an overall residual error between the first sample wake up speech feature and the first reference wake up speech feature in each dimension; calculating a contraction covariance matrix of the overall residual; Carrying out random singular value decomposition on the contraction covariance matrix to obtain an accent vector space; Projecting the first reference wake-up voice feature into an orthogonal complement space of the accent vector space to obtain an accentuation feature; Subtracting the average value of the accentuation features from the first reference wake-up speech features of each dimension to update the overall residual; Calculating accent variation amplitude between the accent vector spaces before and after updating the total residual error; And if the accent variation amplitude is smaller than a preset vector threshold value, determining that the accent vector space after updating the total residual is the accent feature when the user speaks a dialect.
4. The method of claim 3, wherein the separating accent features of the user speaking dialect from the first reference wake up speech features in accordance with the first sample wake up speech features further comprises: Determining sample middle voice characteristics, wherein the sample middle voice characteristics are the output of middle layers in a designated range in the voice characteristic network when the first sample wake-up voice characteristics are extracted; Determining a reference intermediate voice feature, wherein the reference intermediate voice feature is the output of an intermediate layer in a designated range in the voice feature network when the first reference wake-up voice feature is extracted; For the same middle layer, calculating the difference value between the sample middle voice feature and the reference middle voice feature in each dimension as a middle residual; calculating the ratio of the variance of the intermediate residual error to the accent stability, so as to obtain the distinction degree, wherein the accent stability is the average value of the similarity between the intermediate voice characteristics of the samples; and replacing the first sample wake-up voice feature with the sample middle voice feature with the plurality of dimensions with the highest degree of distinction, and replacing the first reference wake-up voice feature with the reference middle voice feature with the plurality of dimensions with the highest degree of distinction.
5. The method of claim 3, wherein the separating accent features of the user speaking dialect from the first reference wake up speech features in accordance with the first sample wake up speech features further comprises: calculating an individual residual error between the newly added first reference wake-up voice feature and the first sample wake-up voice feature; updating the overall residual according to the individual residual value; And updating the shrink covariance matrix according to the updated overall residual.
6. The method of claim 5, wherein updating the overall residual from the individual residual values comprises: Determining a first residual variation amplitude, wherein the first residual variation amplitude is a difference value between the average value of the individual residual and the overall residual; Determining a new addition coefficient which is the inverse of the sum of the total residual quantity and 1; adding the product between the new coefficient and the residual variation amplitude on the basis of the average value of the overall residual to update the overall residual; the updating the shrink covariance matrix according to the updated overall residual comprises the following steps: Determining an attenuation coefficient, wherein the attenuation coefficient is the ratio of the difference value of the number of the total residuals minus 1 to the number of the total residuals; Determining a second residual variation amplitude, wherein the second residual variation amplitude is a difference value between the individual residual and the updated overall residual; And adding the product among the new coefficient, the first residual variation amplitude and the transpose matrix of the second residual variation amplitude on the basis of the product among the attenuation coefficient and the shrink covariance matrix so as to update the shrink covariance matrix.
7. The method of any of claims 1-6, wherein the voice feature network comprises an encoder, a decoder, and a linear layer, wherein the merging the accent features into the first sample wake up voice features to obtain second sample wake up voice features according to the voice feature network comprises: Determining sample coding voice characteristics, wherein the sample coding voice characteristics are output by the encoder when the first sample wake-up voice characteristics are extracted; splicing the accent features and the sample coding voice features into original multi-voice features; inputting the original multi-voice characteristics into a preset two-way long-short-term memory network to extract candidate multi-voice characteristics; Inputting the candidate multi-voice features into the decoder for decoding into target multi-voice features; the target multi-voice feature is input to the linear layer map as a second sample wake-up voice feature.
8. The method of claim 2, wherein performing a wake operation in accordance with the target speech signal, the first sample wake speech feature, and the second sample wake speech feature comprises: Inputting the target voice signal into the voice feature network to extract target wake-up voice features; And if the similarity between the target wake-up voice feature and the first sample wake-up voice feature is greater than or equal to the first confidence coefficient and smaller than the second confidence coefficient, and the similarity between the target wake-up voice feature and the second sample wake-up voice feature is greater than or equal to the second confidence coefficient, executing wake-up operation.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements a method of fine tuning wake-up words as claimed in any one of claims 1-8 when executing the computer program.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements a method of fine tuning a wake word as claimed in any one of claims 1 to 8.

Description

Fine tuning method, device and storage medium of wake-up word Technical Field The invention belongs to the technical field of voice processing, and particularly relates to a fine tuning method, equipment and a storage medium of wake-up words. Background The intelligent electronic devices such as a sound box, a television and a lamp provide personalized services for users, and generally depend on personalized wake-up words of the users, namely, the electronic devices enter a working state from a standby state after detecting wake-up words uttered by the users, and load various parameters configured for the users. At present, the wake-up words are generally recorded when the electronic device is activated or reset for the first time, the user usually speaks the wake-up words strictly in standardized languages such as mandarin, english and the like to record the electronic device, and in the process that the user uses the intelligent device, a plurality of wake-up words are in a relaxed state, and certain accents may exist in the spoken wake-up words, so that the success rate of waking up the device by using the wake-up words is reduced. Disclosure of Invention In view of the above, the present invention provides a method, apparatus and storage medium for fine tuning wake-up words, which are used to increase the success rate of waking up the apparatus using wake-up words. The first aspect of the present invention provides a method for fine tuning a wake-up word, including: When a sample voice signal recorded by a user as a wake-up word is received, inputting the sample voice signal into a voice feature network to extract a first sample wake-up voice feature; If the sample voice signal is not dialect, monitoring a repeated wake-up event for the user, wherein the repeated wake-up event is that wake-up operation is successful after wake-up operation failure; inquiring a first reference awakening voice characteristic extracted from the voice characteristic network by a first reference voice signal when awakening operation fails; if the first reference voice signal is a dialect, separating an accent feature when the user speaks the dialect from the first reference wake-up voice feature according to the first sample wake-up voice feature; According to the voice feature network, the accent features are fused into the first sample awakening voice features, and second sample awakening voice features are obtained; and when a target voice signal is received, executing a wake-up operation according to the target voice signal, the first sample wake-up voice feature and the second sample wake-up voice feature. A second aspect of the present invention provides a fine adjustment device for wake-up words, including: The system comprises a first sample feature extraction module, a second sample feature extraction module and a second sample feature extraction module, wherein the first sample feature extraction module is used for inputting a sample voice signal input by a user as a wake-up word into a voice feature network to extract a first sample wake-up voice feature when the sample voice signal is received; the repeated wake-up event monitoring module is used for monitoring the repeated wake-up event for the user if the sample voice signal is not dialect, wherein the repeated wake-up event is that wake-up operation is successful after wake-up operation failure; the reference feature query module is used for querying a first reference wake-up voice feature extracted from the voice feature network by a first reference voice signal when the wake-up operation fails; The accent feature separation module is used for separating accent features when the user speaks the dialect from the first reference awakening voice features according to the first sample awakening voice features if the first reference voice signal is the dialect; the second sample feature generation module is used for integrating the accent features into the first sample wake-up voice features according to the voice feature network to obtain second sample wake-up voice features; and the wake-up operation execution module is used for executing wake-up operation according to the target voice signal, the first sample wake-up voice feature and the second sample wake-up voice feature when receiving the target voice signal. A third aspect of the present invention provides an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing a method of fine tuning wake-up words as described in the first aspect when executing the computer program. A fourth aspect of the present invention provides a computer readable storage medium storing a computer program which, when executed by a processor, implements a method of fine tuning a wake-up word as described in the first aspect above. A fifth aspect of the invention provides a computer program product which, when run on a computer, causes