CN-121999766-A - Word sequence determining method and device in semantic recognition, storage medium and electronic device

CN121999766ACN 121999766 ACN121999766 ACN 121999766ACN-121999766-A

Abstract

The application discloses a method and a device for determining word sequences in semantic recognition, a storage medium and an electronic device, and relates to the technical field of smart families, wherein the method comprises the steps of linearly projecting a hidden vector sequence corresponding to a voice signal to obtain multi-granularity posterior probability of the voice signal, wherein the multi-granularity posterior probability comprises character posterior probability, sub-word posterior probability and word posterior probability; searching word sets and character sets associated with the sub-words in the voice signal in the multi-level mapping table, calculating target path scores of the sub-words in the voice signal according to multi-granularity posterior probability, the word sets and the character sets, inputting the target path scores into a target decoder to obtain target sub-word sequences, and obtaining target word sequences corresponding to the target sub-word sequences according to the multi-level mapping table. The application can solve the problem of homophone misidentification of 'near-tone missense' of rare words, polyphones, professional terms and the like in the end-to-end ASR process in the related technology.

Inventors

YU JINGHUAN
ZHU WENBO
YIN DESHUAI
DUAN QUANSHENG

Assignees

青岛海尔科技有限公司
海尔优家智能科技（北京）有限公司
青岛海尔智能家电科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260408

Claims (15)

1. A method for determining word sequences in semantic recognition, comprising: performing linear projection on a hidden vector sequence corresponding to a voice signal to obtain multi-granularity posterior probability of the voice signal, wherein the multi-granularity posterior probability comprises character posterior probability, sub-word posterior probability and word posterior probability; Searching a word set and a character set which are associated with the subwords in the voice signal in a multi-level mapping table, and calculating target path scores of the subwords in the voice signal according to the multi-granularity posterior probability, the word set and the character set, wherein the target path scores are target path scores determined according to original path scores of the subwords in the voice signal and calibration scores of the subwords in the voice signal; inputting the target path score into a target decoder to obtain a target sub-word sequence, and obtaining a target word sequence corresponding to the target sub-word sequence according to the multi-level mapping table; The method for obtaining the multi-granularity posterior probability of the voice signal comprises the following steps of: Inputting the hidden vector sequence into a multi-path linear structure to instruct the multi-path linear structure to perform linear projection on the hidden vector sequence, so as to obtain multi-granularity probability corresponding to the voice signal; and performing function processing on the multi-granularity probability through an objective function to obtain the multi-granularity posterior probability.
2. The method of claim 1, wherein before linearly projecting the hidden vector sequence corresponding to the speech signal to obtain the multi-granularity posterior probability of the speech signal, the method further comprises: under the condition that the voice signal is obtained, carrying out framing processing on the voice signal to obtain a multi-frame voice signal; extracting frequency domain characteristics corresponding to each frame of voice signal through Fourier transformation, and calculating frequency spectrum corresponding to the voice signal according to a first formula Wherein the first formula is: , the frequency domain characteristics corresponding to the T-th frame of voice signals are obtained, and T is the number of frames of the multi-frame voice signals; and inputting the frequency spectrum into a target encoder to obtain a hidden vector sequence corresponding to the voice signal.
3. The method of claim 1, wherein prior to looking up word sets and character sets associated with subwords in the speech signal in a multi-level mapping table, the method further comprises: Converting each corpus in a preset corpus into a first character sequence, a first sub-word sequence and a first word sequence respectively; Determining a first frequency of first characters contained in each first character sequence in the plurality of first character sequences, a second frequency of first subwords contained in each first subword sequence in the plurality of subword sequences, and a third frequency of first words contained in each first subword sequence in the plurality of first word sequences; Screening a plurality of first characters for second characters with a first frequency greater than a first threshold value, screening a plurality of first sub-words for second sub-words with a second frequency greater than a second threshold value, and screening a plurality of first word sequences for second words with a third frequency greater than a third threshold value; and constructing the multi-level mapping table according to the second character, the second sub-word and the second word.
4. The method of claim 3, wherein constructing the multi-level mapping table from the second character, the second subword, and the second word comprises: Determining a third sub-word to which each second character belongs from a plurality of second sub-words, and determining the third word to which each second character belongs from a plurality of second words; constructing a triplet according to each second character, the third sub-word and the third word; Constructing a target mapping matrix according to a plurality of triples, wherein the target mapping matrix at least comprises one of a first mapping matrix, a second mapping matrix and a third mapping matrix, wherein the first mapping matrix is used for indicating the mapping relation between sub words and characters, the second mapping matrix is used for indicating the mapping relation between the sub words and the words, and the third mapping matrix is used for indicating the mapping relation between the characters and the words; And combining the target mapping matrixes to obtain the multi-level mapping table.
5. The method of claim 1, wherein calculating a target path score for a subword in the speech signal based on the multi-granularity posterior probability, the word set, and the character set comprises: calculating the original path score of the sub word corresponding to the voice signal according to the sub word posterior probability; Calculating calibration scores of sub words corresponding to the voice signals according to the word posterior probability, the word set, the character posterior probability and the character set; and calculating a target path score of the subword corresponding to the voice signal according to the sum value of the original path score and the calibration score.
6. The method for determining word sequences in semantic recognition according to claim 5, wherein calculating the original path score of the corresponding subword of the speech signal according to the posterior probability of the subword comprises: Constructing an initial Viterbi matrix, wherein a first row in the initial Viterbi matrix is used for indicating blank symbols, other rows in the initial Viterbi matrix are used for indicating subwords corresponding to each frame of voice signals, and each column in the initial Viterbi matrix is used for indicating each frame corresponding to the voice signals, wherein the voice signals comprise multi-frame voice signals; populating a first column in the initial viterbi matrix based on the first initial score and the second initial score; Determining a first score of each sub-word in the initial viterbi matrix corresponding to an m-1 th column, and determining sub-word posterior probability of the sub-word corresponding to an m-th frame voice signal, wherein m sequentially takes 2, 3. Determining a second score corresponding to each row in an mth column according to the first score and the sub-word posterior probability of the sub-word corresponding to the mth frame voice signal; And calculating the original path score of the subword corresponding to the voice signal according to the second score and the initial Viterbi matrix.
7. The method according to claim 5, wherein calculating the alignment score of the sub-word corresponding to the speech signal based on the word posterior probability and the word set, the character posterior probability and the character set, comprises: calculating a first probability corresponding to a fourth word associated with a sub-word corresponding to each frame of voice signal in the word set according to the word posterior probability, and calculating a second probability corresponding to a fourth character associated with the sub-word corresponding to each frame of voice signal in the character set according to the character posterior probability, wherein the voice signal comprises multi-frame voice signals; and calculating the calibration score of the corresponding subword of each frame of voice signal according to the first probability and the second probability.
8. The method for determining word sequences in semantic recognition according to claim 7, wherein calculating the alignment score of the corresponding subword of each frame of speech signal according to the first probability and the second probability comprises: Calculating the calibration score of the subword corresponding to each frame of voice signal according to a second formula, wherein the second formula is as follows: ; Wherein, the For the alignment of the subword corresponding to the t-th frame of speech signal, A first probability of a fourth word w associated with a sub-word corresponding to the t-th frame voice signal in the word set; for a second probability of a fourth character c in the character set associated with a subword corresponding to the t-th frame speech signal, Is a subword corresponding to the t-th frame voice signal The set of words to be associated with, Is a subword corresponding to the t-th frame voice signal The character set to be associated with is selected, Is a weight coefficient.
9. The method of claim 1, wherein inputting the target path score into a target decoder to obtain a target sub-word sequence comprises: Determining a first sub-word corresponding to an nth frame of voice signal, and inputting a target path score of the first sub-word corresponding to the nth frame of voice signal and a target path score of a sub-word corresponding to an n-1 st frame of voice signal into the target decoder so that the target decoder outputs a second sub-word corresponding to the n-1 st frame of voice signal, wherein n sequentially takes j, j-1, j-2, and the terms of the first sub-word and the sub-word corresponding to the n-1 st frame of voice signal are positive integers; and constructing the target sub word sequence according to the first sub word and the plurality of second sub words.
10. The method of claim 9, wherein constructing the target sub word sequence from the first sub word and the plurality of second sub words, comprises: Determining a first frame number corresponding to the first sub word and a second frame number corresponding to each second sub word; and sequentially ordering the first sub word and the plurality of second sub words based on the size of a target frame sequence number to generate the target sub word sequence, wherein the target frame sequence number comprises the first frame sequence number and the plurality of second frame sequence numbers.
11. The method for determining word sequences in semantic recognition according to claim 1, wherein obtaining the target word sequences corresponding to the target sub-word sequences according to the multi-level mapping table comprises: determining whether a third sub word with continuous positions and repeated meanings exists in the target sub word sequence, and determining whether a blank symbol exists in the target sub word sequence; Performing de-duplication processing on the target sub word sequence and/or deleting blank symbols in the target sub word sequence under the condition that the third sub word exists in the target sub word sequence and/or the blank symbols exist in the target sub word sequence; And calculating word sequences corresponding to the target sub-word sequences after the blank symbols are subjected to de-duplication processing and/or deleting according to the multi-level mapping table, and determining the word sequences corresponding to the target sub-word sequences after the blank symbols are subjected to de-duplication processing and/or deleting as the target word sequences.
12. The method for determining word sequences in semantic recognition according to claim 11, wherein calculating word sequences corresponding to target sub-word sequences after de-duplication processing and/or deletion of blank symbols according to the multi-level mapping table comprises: querying a fifth word associated with each sub-word contained in the target sub-word sequence after the duplication elimination and/or deletion of blank symbols in the multi-level mapping table; Determining whether the number of fifth words associated with each subword contained in the target subword sequence after the duplication elimination and/or deletion of blank symbols is a target number; If the number of the fifth words associated with the fourth sub word contained in the target sub word sequence after the duplication elimination and/or deletion of the blank symbol is not the target number, inquiring a target frequency corresponding to each fifth word associated with the fourth sub word in the multi-level mapping table, and determining target words with the target frequency being greater than a preset frequency threshold value in a plurality of fifth words associated with the fourth sub word; determining the fifth word as a target word corresponding to a fourth sub word when the number of the fifth word associated with the fourth sub word contained in the target sub word sequence after the duplication elimination and/or deletion of the blank symbol is determined to be the target number; And constructing word sequences corresponding to the target sub-word sequences after the duplication elimination and/or deletion of the blank symbols according to the target words.
13. A word sequence determination apparatus in semantic recognition, comprising: the computing module is used for carrying out linear projection on the hidden vector sequence corresponding to the voice signal so as to obtain multi-granularity posterior probability of the voice signal, wherein the multi-granularity posterior probability comprises character posterior probability, sub-word posterior probability and word posterior probability; the searching module is used for searching a word set and a character set which are associated with the sub-words in the voice signal in a multi-level mapping table, and calculating a target path score of the sub-words in the voice signal according to the multi-granularity posterior probability, the word set and the character set; the input module is used for inputting the target path score into a target decoder to obtain a target sub-word sequence, and obtaining a target word sequence corresponding to the target sub-word sequence according to the multi-level mapping table; The computing module is further configured to input the hidden vector sequence into a multi-path linear structure to instruct the multi-path linear structure to perform linear projection on the hidden vector sequence to obtain multi-granularity probability corresponding to the voice signal, and perform function processing on the multi-granularity probability through an objective function to obtain the multi-granularity posterior probability.
14. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run performs the method of any one of claims 1 to 12.
15. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 12 by means of the computer program.

Description

Word sequence determining method and device in semantic recognition, storage medium and electronic device Technical Field The application relates to the technical field of smart families, in particular to a word sequence determining method and device in semantic recognition, a storage medium and an electronic device. Background In an open scenario, end-to-end automatic speech recognition (Automatic Speech Recognition, abbreviated ASR) commonly uses a "Encoder-Decoder" (encoder-Decoder) framework, encoder, to output acoustic vectors, and the Decoder directly generates text through connection timing classification (Connectionist Temporal Classification, abbreviated CTC) or Attention mechanism. Because of the independent assumption of CTC conditions and the exposure deviation of the Attention, the posterior distribution of rare words and polyphone areas is easily "submerged" by homophonic high-frequency words, so that homophonic misidentification of rare words, polyphones and professional terms, which often have "near-sound missense", is caused. If the external language model is introduced to re-score, additional delay is brought, and the mobile terminal cannot run in real time. Therefore, in the end-to-end ASR process existing in the related art, homophone misidentification of "near-tone missense" will occur for rare words, polyphones, and professional terms, etc. Aiming at the problem of homonymy misidentification of 'near-tone missense' of rare words, polyphones, professional terms and the like in the end-to-end ASR process in the related technology, no effective solution has been proposed. Disclosure of Invention The embodiment of the application provides a method and a device for determining word sequences in semantic recognition, a storage medium and an electronic device, which at least solve the problem of homonym misidentification of rare words, polyphones, professional terms and the like which can cause "near-sound missense" in the process of end-to-end ASR in the related technology. According to one embodiment of the application, a word sequence determining method in semantic recognition is provided, which comprises the steps of linearly projecting a hidden vector sequence corresponding to a voice signal to obtain multi-granularity posterior probability of the voice signal, wherein the multi-granularity posterior probability comprises character posterior probability, sub-word posterior probability and word posterior probability, searching a word set and a character set associated with sub-words in the voice signal in a multi-level mapping table, calculating a target path score of the sub-words in the voice signal according to the multi-granularity posterior probability, the word set and the character set, inputting the target path score into a target decoder to obtain a target sub-word sequence, and obtaining a target word sequence corresponding to the target sub-word sequence according to the multi-level mapping table. In an exemplary embodiment, the linear projection of the hidden vector sequence corresponding to the voice signal to obtain the multi-granularity posterior probability of the voice signal comprises the steps of inputting the hidden vector sequence into a multi-path linear structure to instruct the multi-path linear structure to linearly project the hidden vector sequence to obtain the multi-granularity probability corresponding to the voice signal, and performing function processing on the multi-granularity probability through an objective function to obtain the multi-granularity posterior probability. In one exemplary embodiment, before linearly projecting the hidden vector sequence corresponding to the voice signal to obtain the multi-granularity posterior probability of the voice signal, the method further comprises framing the voice signal to obtain multi-frame voice signals if the voice signal is obtained, extracting the frequency domain characteristics corresponding to each frame of voice signal through Fourier transformation, and calculating the frequency spectrum corresponding to the voice signal according to a first formulaWherein the first formula is:, The frequency domain characteristics corresponding to the T-th frame of voice signals are obtained, and T is the number of frames of the multi-frame voice signals; and inputting the frequency spectrum into a target encoder to obtain a hidden vector sequence corresponding to the voice signal. In an exemplary embodiment, before searching word sets and character sets associated with sub-words in the voice signal in a multi-level mapping table, the method further comprises the steps of converting each corpus in a preset corpus into a first character sequence, a first sub-word sequence and a first word sequence, determining first frequencies of first characters contained in each first character sequence in the first character sequences, second frequencies of the first sub-words contained in each first sub-word sequence in the first