Search

CN-114283796-B - Automatic voice recording method for online customization and updating of hotword in telephone scene

CN114283796BCN 114283796 BCN114283796 BCN 114283796BCN-114283796-B

Abstract

The invention discloses an automatic voice recording method for online customization and updating of hotwords in a telephone scene, which comprises the steps of firstly training an automatic voice recognition model based on a deep neural network, then retraining the automatic voice recognition model through a telephone audio file to generate a pre-training model; customizing a differentiated language model self-updating model and a hotword list, and adjusting the weight of a decoding stage in real time to form the self-adaptive voice recognition hotword system. The invention provides a voice recognition system capable of customizing hot words on line based on telephone customer service of different scenes, which generates hot word lists in different scenes, improves the accuracy of converting voice into text, provides a hot word weight real-time updating algorithm, counts hot word frequencies on line, re-learns 8K telephone audio by adopting a transfer learning mode according to real-time updating model parameters of word frequencies in a decoding stage of an acoustic model, and improves the accuracy of audio voice recognition in telephone scenes.

Inventors

  • CAO JIUWEN
  • QIAN YIYANG
  • WANG TIANLEI
  • LIU PENG
  • Xiang Jianfa

Assignees

  • 杭州电子科技大学
  • 杭州电子科技大学

Dates

Publication Date
20260421
Application Date
20211123
Priority Date
20211123

Claims (2)

  1. 1. An automatic voice recording method for online customization and updating of hotwords in a telephone scene is characterized by comprising the following steps: s1, training an automatic voice recognition model based on a deep neural network by adopting a public general voice data set: The training of the automatic voice recognition model comprises the steps of determining basic parameters of the automatic voice recognition model, initializing weights of all layers of the automatic voice recognition model and determining an optimization method; S2, retraining the automatic speech recognition model through an 8K telephone audio file acquired in a telephone scene to generate a pre-trained model of the automatic speech recognition model based on the 8K telephone scene; Converting 8K audio in a telephone scene into a spectrogram, training the spectrogram through transfer learning, re-training an automatic voice recognition model obtained by training on 16K public standard data audio, namely a teacher model, obtaining a student model, fine tuning the model, and guiding the student model to train by using the teacher model; S3, deploying and operating the pre-training model, wherein in the operation process, the pre-training model is utilized to identify conversation voice, and the obtained text data is stored for online customization of hotwords; S4, customizing a self-updating model and a hotword list of the differentiated language model, and adjusting the weight of a decoding stage in real time to form a self-adaptive voice recognition hotword system; s4-1, word segmentation is carried out on text data, stop words are removed, and word frequency is extracted; S4-2, normalizing word frequency, and normalizing the word frequency according to probability difference, wherein the normalization uses the word frequency as follows: Wf represents word frequency, wf * represents normalized word frequency, μ represents word frequency mean value, σ represents word frequency standard deviation; S4-3, extracting general hot word list and differential hot word list in different scenes, determining corresponding activation threshold and differential threshold according to median value of word frequency, taking a certain hot word as differential hot word if the use frequency of the hot word in the current scene exceeds the activation threshold and is smaller than the differential threshold in different scenes, carrying out weight adjustment on the decoding stage of the model only in the scene exceeding the differential threshold, taking the hot word as general hot word if the use frequency of the hot word in the current scene exceeds the activation threshold and exceeds the differential threshold in different scenes, increasing weight in the decoding stage of the model in all scenes, and judging the following formula: word is a word for distinguishing, hotword common is a general hot word, hotword particulary is a differential hot word, wf represents word frequency, acTh represents an activation threshold, and DiffTh represents a differential threshold; S4-4, when the system decodes the audio to generate text data, parameter adjustment and optimization are carried out on decoding results according to a hotword weight table, when the spelling of the output characters of a decoder of a transducer model is the same as hotwords in the hotword table, a hotword judging mode is carried out, a path containing hotwords is newly added, offset scores are added in the path decoding scores, and scoring is carried out again, wherein the formula is as follows: δ k =log(1+Wf) where score (y 1 ,...,y s ) is the final score on the path of the first word y 1 to the last word y s under Beam Search decoding, P lm (y k |y 1 ,...,y k-1 , x) represents the probability that the kth word appears in the output, δ k represents the offset score of the parameter; comparing the obtained new path with other paths, and obtaining a path with the highest score as an optimal path according to the score, and storing a text result; S4-5, designing the updating frequency of the self-adaptive voice recognition hotword system, and automatically updating the hotword system by adjusting the interval time of the updating weight or setting the updating weight at idle; S4-6, providing an interactive hotword updating mode, simultaneously generating json files according to the format of hotword-part of speech-weight grade, uploading the json files to a system, adjusting offset scores according to the weight grade, and carrying out user definition of hotwords to obtain user-defined hotwords so as to meet the requirements of the user, wherein the user-defined hotwords have initial weights, and can dynamically adjust weights, activation thresholds and difference thresholds in the use process of the system, so that the efficiency can be improved in voice recognition, and simultaneously, the voice recognition of the user can be better carried out.
  2. 2. The automatic voice recording method of on-line customization updating of hotwords in a telephone scene according to claim 1, wherein the automatic voice recognition model adopts a transducer model, which is based on a Seq2Seq model of Attention and consists of an encoder Encoder and a Decoder, and is trained by a multi-head Attention module; the specific training procedure of the transducer model is as follows: S1-1, firstly, preprocessing the audio data in the public general voice data set, converting the audio data into Fbank audio features, expanding the width of the data through Embedding word embedding, and converting the data into a three-dimensional tensor; S1-2, adding a position encoder before the tensor is imported into Encoder layers, and adding information which can generate different semantics due to different vocabulary positions into an embedded three-dimensional tensor to make up for the deficiency of the position information, wherein the processing of vocabulary position information is not performed in the transducer encoder; wherein PE (pos,2i) ,PE (pos,2i+1) is the 2i,2i+1 th component of the encoded vector at position pos, respectively, d model represents the dimension of the word vector; S1-3, a transducer model is used for importing a preprocessed three-dimensional tensor processed by a position encoder into Encoder layers, wherein the Attention layer is a main body part of Encoder layers, the tensor W is trained by using Attention through three trainable array parameters Q, K, V, the transducer model is used for dividing the tensor into h head vectors to form a plurality of subspaces, and the Q, K and V are projected through h linear transformations, and finally different Attention results are spliced, and the formula is as follows: Head t =Attention(QW t Q ,KW t K ,VW t V ) MultiHead(Q,K,V)=Concat(Head 1 ,...,Head t ) Where dk represents the dimension of the model, head t represents the output result of multi-Head attention to the t-th Head vector MultiHead, and Concat represents the splicing operation of the model; s1-4, performing feature conversion on feature vectors extracted by the Attention through a Feed Forward layer to increase the expressive capacity of a model, wherein the Feed Forward has two layers, the first layer has an activation function of ReLU, and the second layer has a linear activation function, and the formula is as follows: FFN=Max(0,xW 1 +b 1 )W 2 +b 2 wherein x represents the output of the previous layer, and W 1 ,W 2 ,b 1 ,b 2 is a trainable parameter; After Encoder layers of training, taking the three-dimensional tensor obtained after training as the input of K, V of the Decoder layer; S1-5, performing Embedding word embedding on a text file corresponding to audio data in the disclosed general voice data set, expanding the data into a three-dimensional tensor, adding position codes, and then guiding the tensor into a Decoder layer, wherein the Decoder layer transmits the three-dimensional tensor into an Attention layer and a FeedForward layer for training, Q of the Decoder layer is input from the Decoder layer, K, V is output from Encoder layers, and Q, K, V sources of Encoder are all input from Encoder layers; And S1-6, after passing through the Decoder layer, transmitting the three-dimensional tensor output by the Decoder layer into a linear layer and a Softmax layer to obtain text data, comparing and training with text files corresponding to audio data in a general voice data set, calculating a loss function, calculating the difference between the forward calculation result and a true value of each iteration of a model, and guiding the next training to be carried out in the correct direction until the model converges.

Description

Automatic voice recording method for online customization and updating of hotword in telephone scene Technical Field The invention belongs to the technical field of signal processing, and particularly relates to an automatic voice recording method for online customization and updating of hot words in a telephone scene. The background technology is as follows: most of the communication industries such as telephone customer service and telephone consultation communicate through voice communication. In particular, for industries such as telephone service, a call recording needs to be backed up during a call so that the call can be browsed in the future. The recording backup mode is simple and feasible, but occupies more storage space compared with a mode of adopting text to carry out recording backup, and a consultant cannot quickly locate when turning over. Thanks to the advent of deep learning techniques, automatic speech recognition systems have evolved rapidly, with rapid advances in the accuracy and speed of automatic conversion of speech signals into text. Therefore, the automatic voice recognition system is used for transcribing the call record into the text, so that the storage space can be effectively reduced, and the method is beneficial to consulting. However, the existing automatic speech recognition model is obtained by training on a general data set, so that different customization aiming at different fields and different scenes is often lacking, and how to customize the speech recognition model according to different scenes and improve the accuracy of speech recognition is a problem to be solved urgently. Meanwhile, the hot word customization technology in the current voice recognition only carries out customization on hot words or nouns in the special field, the hot words are required to be manually added for uploading and updating, and the hot word list for customizing various scenes of each industry is not practical to manually collect, and particularly, timeliness and specificity exist in some hot words and personalized words, so that even if the field and the scene are consistent, the hot word list is updated, the old hot word list cannot adapt to the current situation, and the voice recognition under some specific scenes is inaccurate. In addition, telephone audio files are mostly 8K sampling rate audio, and the main stream speech recognition model uses 16K audio files, so that the accuracy is reduced when the 16K audio files are directly used. The invention comprises the following steps: the invention aims to solve the technical problem that the voice recognition system can be customized based on telephone scenes of different industries, so that the customization of the differential voice recognition system can be performed, and the accuracy of voice recognition is improved. Aiming at the problems, the invention provides an automatic voice recording method for online customization and updating of hot words in a telephone scene, which is used for customizing the hot words in real time for different scenes and updating the weights of voice recognition decoding stages in real time through a hot word frequency algorithm, so as to solve the defects. An automatic voice recording method for online customization and updating of hotwords in a telephone scene comprises the following steps: s1, training an automatic voice recognition model based on a deep neural network by adopting a public general voice data set: The training of the automatic voice recognition model comprises the steps of determining basic parameters of the automatic voice recognition model, initializing weights of all layers of the automatic voice recognition model and determining an optimization method. S2, retraining the automatic speech recognition model through an 8K telephone audio file acquired in the telephone scene, and generating a pre-trained model of the automatic speech recognition model based on the 8K telephone scene. And S3, deploying and operating the pre-training model. In the running process, the pre-training model is utilized to recognize the talking voice, and the obtained text data is stored for online customization of hotwords. And S4, customizing a self-updating model and a hotword list of the differentiated language model, and adjusting the weight of the decoding stage in real time to form the self-adaptive voice recognition hotword system. Preferably, the automatic speech recognition model adopts a transducer model, which is based on the Seq2Seq model of Attention and consists of an encoder Encoder and a Decoder, and is trained by a multi-head Attention module. The specific training procedure of the transducer model is as follows: S1-1, firstly, preprocessing the audio data in the public general voice data set, converting the audio data into Fbank audio features, expanding the width of the data through Embedding word embedding, and converting the data into a three-dimensional tensor. S1-2, adding a positio