US-20260128039-A1 - ENABLING CUSTOM WORD IDENTIFICATION IN SPEECH-TO-TEXT MODELS

US20260128039A1US 20260128039 A1US20260128039 A1US 20260128039A1US-20260128039-A1

Abstract

Language models require resource-intensive training. As a result, retraining language models happens very infrequently even though new or custom words are needed at a much faster rate. By deploying a custom set of words, speech recognition may be performed using a previously trained language model and augmented with entries in a custom word list. A probability map is created for each token position predicted by the model. Next, potential positions for custom words are identified by calculating the probability ratio between the token selected by the model and the custom word token. The probability ratios are summed for the first, second, and last tokens of the custom words, and if the sum falls below a certain threshold, the position is recorded. Next, the word or words starting at the recorded position are identified and, using string comparison metrics, are determined the most likely candidates for replacement.

Inventors

SEAN MARK BLANCHFLOWER
Wenting Zhang

Assignees

MICRO FOCUS LLC

Dates

Publication Date: 20260507
Application Date: 20241106

Claims (20)

1 . A method, comprising: accessing speech to be recognized; providing the speech to a previously trained language model and receiving a first set of tokens therefrom; providing the speech to a custom language model and receiving a second set of tokens therefrom; determining a position within the speech where a token of the second set of tokens is a better fit than a token of the first set of tokens; and replacing a default word of the speech, determined by the previously trained language model, with a custom word at the position.
2 . The method of claim 1 , wherein: determining the position within the speech where the token of the second set of tokens is the better fit than the token of the first set of tokens comprises determining the position within the speech where a first ratio is greater than a second ratio; the first ratio is determined by summing probability ratios of less than all tokens of the custom word; and determining the position in the speech comprises determining where in the speech the first ratio is below a previously determined threshold.
3 . The method of claim 2 , wherein summing the probability ratios of less than all tokens of the custom word comprises summing the probability ratios of a first token, a second token, and a last token of the custom word.
4 . The method of claim 1 , wherein replacing the default word of the speech determined by the previously trained language model with the custom word at the position further comprises using string comparison metrics to select a best match of a set of custom words, comprising the custom word, to a word at the position.
5 . The method of claim 1 , wherein replacing the word with the custom word at the position further comprises providing the custom word as a portion of a transcription.
6 . The method of claim 1 , wherein replacing the default word with the custom word at the position further comprises providing the custom word as a portion of a command to a computing device.
7 . The method of claim 1 , wherein the previously trained language model comprises at least one of a large language model or a neural network trained to recognize a generic set of words.
8 . The method of claim 1 , wherein at least one of the default word and the custom word comprise a plurality of words.
9 . A system, comprising: a computing device comprising one or more processors coupled to a computer memory comprising instructions; and wherein the instructions, when read by the one or more processors, cause the one or more processors to perform: accessing speech to be recognized; providing the speech to a previously trained language model and receiving a first set of tokens therefrom; providing the speech to a custom language model and receiving a second set of tokens therefrom; determining a position within the speech where a token of the second set of tokens is a better fit than a token of the first set of tokens; and replacing a default word of the speech, determined by the previously trained language model, with a custom word at the position.
10 . The system of claim 9 , wherein: determining the position within the speech where the token of the second set of tokens is the better fit than the token of the first set of tokens comprises determining the position within the speech where a first ratio is greater than a second ratio; the first ratio is determined by summing probability ratios of less than all tokens of the custom word; and determining the position in the speech comprises determining where in the speech the first ratio is below a previously determined threshold.
11 . The system of claim 10 , wherein summing the probability ratios of less than all tokens of the custom word comprises summing the probability ratios of a first token, the second token, and a last token of the custom word.
12 . The system of claim 9 , wherein replacing the default word of the speech determined by the previously trained language model with the custom word at the position further comprises using string comparison metrics to select a best match of a set of custom words, comprising the custom word, to a word at the position.
13 . The system of claim 9 , wherein replacing the default word with the custom word at the position further comprises providing the custom word as a portion of a transcription.
14 . The system of claim 9 , wherein replacing the default word with the custom word at the position further comprises providing the custom word as a portion of a command to a computing device.
15 . The system of claim 9 , wherein the previously trained language model comprises at least one of a large language model or a neural network trained to recognize a generic set of words.
16 . The system of claim 9 , wherein at least one of the default word and the custom word comprise a plurality of words.
17 . A non-transitory computer readable medium comprising instructions that, when read by a machine, cause the machine to perform: accessing speech to be recognized; providing the speech to a previously trained language model and receiving a first set of tokens therefrom; providing the speech to a custom language model and receiving a second set of tokens therefrom; determining a position within the speech where a token of the second set of tokens is a better fit than a token of the first set of tokens; and replacing a default word of the speech, determined by the previously trained language model, with a custom word at the position.
18 . The non-transitory computer readable medium of claim 17 , further comprising instructions to cause the machine to perform: determining the position within the speech where the token of the second set of tokens is the better fit than the token of the first set of tokens comprising determining the position within the speech where a first ratio is greater than a second ratio; the first ratio is determined by summing probability ratios of less than all tokens of a custom word; and determining the position in the speech comprises determining where in the speech the first ratio is below a previously determined threshold.
19 . The non-transitory computer readable medium of claim 18 , further comprising instructions to cause the machine to perform summing the probability ratios of less than all tokens of the custom word comprising summing the probability ratios of a first token, a second token, and a last token of the custom word.
20 . The non-transitory computer readable medium of claim 17 , further comprising instructions to cause the machine to perform replacing the default word of the speech determined by the previously trained language model with the custom word at the position, further comprising using string comparison metrics to select a best match of a set of custom words, comprising the custom word, to a word at the position.

Description

FIELD OF THE DISCLOSURE The invention relates generally to systems and methods for automated speech recognition and particularly to supplementing a previously trained language model with custom words without retraining the model. BACKGROUND Understanding human speech is complex and nuanced, whether for humans or machines. Humans may have accents, different speeds of talking, and other differences that complicate understanding of spoken language. Humans may say the same words or sounds with different meanings, meanings of words may change based on inflection, and other nuanced speech patterns may exist or be present without intending to change the meaning. Humans are well adapted to recognize the context of speech, although misunderstandings still occur. When speech is heard and believed to be accurately understood, the context of the speech may provide a different meaning or, if nothing else, indicate a lack of certainty to the listener. The prior art speech recognition techniques have advanced from user-specific training to become adaptable to a broader range of individuals and their particular speaking patterns. Large language models (LLMs) are commonly used to enable computer systems to perform speech recognition. However, training such a model represents a very significant investment in computing hardware and operating such hardware; even the electricity required to train an LLM can be a substantial cost. As a result, once an LLM is trained, it will likely not be retrained for some time. Human speech is always advancing, and the need to add new words to an LLM begins almost as soon as training ends. As a result, automated speech recognition systems are often out of date and fail to recognize words other than those used in the training set. SUMMARY Modern speech recognition systems, in particular LLM-based systems, are well adapted to handle speech consisting of words known at the time the LLM was trained. New words are commonly introduced in the form of product names and individuals, such as government or business officials with obscure names. As a result, computer-recognized speech often requires extensive manual correction or may be unusable due to an unacceptable level of errors, especially if the error is to a word or phrase that is key to a discussion. These and other needs are addressed by the various embodiments and configurations of the present invention. The present invention can provide a number of advantages depending on the particular configuration. These and other advantages will be apparent from the disclosure of the invention(s) contained herein. In one embodiment, an extensible system for allowing speech-to-text of custom words and phrases is disclosed. The custom words may include, but are not limited to, peoples' names, locations, or company and product names that are not present in the training data. In another embodiment, an existing trained model can be augmented at run-time to look for custom words. First, a probability map is created for each token (usually a word) position predicted by the model. Next, potential positions for custom words are identified by calculating the probability ratio between the token selected by the model and the custom word token. The probability ratios for the first, second and last tokens of the custom words are summed and, if the sum falls below a certain threshold, the position is recorded. Next, the word or words starting at the recorded position are identified and, using string comparison metrics, determine, such as by a processor, the most likely candidates for replacement. As a benefit, the metrics help to ensure that the word being replaced closely matches the custom word, reducing the risk of false positives. In some aspects, the techniques described herein relate to a method, including: accessing speech to be recognized; providing the speech to a previously trained language model and receiving a first set of tokens therefrom; providing the speech to a custom language model and receiving a second set of tokens therefrom; determining a position within the speech where a token of the second set of tokens is a better fit than a token of the first set of tokens; and replacing a default word of the speech, determined by the previously trained language model, with a custom word at the position. In some aspects, the techniques described herein relate to a method, wherein: determining the position within the speech where the token of the second set of tokens is the better fit than the token of the first set of tokens includes determining the position within the speech where a first ratio is greater than a second ratio; the first ratio is determined by summing probability ratios of less than all tokens of the custom word; and determining the position in the speech includes determining where in the speech the first ratio is below a previously determined threshold. In some aspects, the techniques described herein relate to a method, wherein summing the pro