US-20260128036-A1 - METHODS FOR NATURAL LANGUAGE MODEL TRAINING IN NATURAL LANGUAGE UNDERSTANDING (NLU) SYSTEMS
Abstract
Systems and methods for determining to perform an action of a query using a trained natural language model of a natural language understanding (NLU) system are disclosed herein. A text string corresponding to a prescribed action includes at least a content entity is received. A determination is made as to whether the text string corresponds to an audio input of a first group. In response to determining the text string corresponds to an audio input of a first group, a determination is made as to whether the text string includes an obsequious expression. In response to determining the text string corresponds to an audio input of a first group and in response to determining the text string includes an obsequious expression, a determination is made to perform the prescribed action. In response to determining the text string corresponds to an audio input of a first group and in response to determining the text string does not include the obsequious expression, a determination is made to not perform the prescribed action.
Inventors
- Jeffry Copps Robert Jose
- Mithun Umesh
Assignees
- ADEIA GUIDES INC.
Dates
- Publication Date
- 20260507
- Application Date
- 20251002
Claims (20)
- 1 . (canceled)
- 2 . A computer-implemented method, comprising: receiving a voice query; determining that the voice query comprises audio input of a user belonging to a grouping classification; determining the voice query comprises an obsequious portion comprising an obsequious expression; identifying a non-obsequious portion of the voice query, wherein the obsequious portion of the voice query does not describe the non-obsequious portion of the voice query; and based at least in part on determining that the received voice query comprises the audio input of the user that belongs to the grouping classification and that the received voice query comprises the obsequious portion, causing output of data in relation to the voice query.
- 3 . The method of claim 2 , wherein the grouping classification corresponds to users categorized as children, and wherein determining the voice query comprises the audio input from the user belonging to the grouping classification is based at least in part on analyzing one or more audio characteristics of the audio input to determine that the user is a child.
- 4 . The method of claim 3 , wherein determining the voice query comprises the obsequious expression is performed based at least in part on determining that the voice query comprises the audio input of the child.
- 5 . The method of claim 2 , wherein: the voice query comprises a request to perform an action; the method further comprises: determining that the non-obsequious portion comprises an indication of the action and an indication of a content item; and causing output of the data comprises causing performance of the action indicated in the non-obsequious portion of the voice query, wherein the action relates to the content item.
- 6 . The method of claim 2 , wherein causing output of the data comprises causing output of a voice reply to the voice query.
- 7 . The method of claim 2 , wherein the voice query is received at a first time, and the method further comprises: receiving a subsequent voice query at a second time; determining the subsequent voice query does not comprise a subsequent obsequious expression, wherein determining the subsequent voice query does not comprise the subsequent obsequious expression is performed based at least in part on determining that the subsequent voice query comprises the audio input of the user belonging to the grouping classification; and causing output of a reply to the subsequent voice query.
- 8 . The method of claim 7 , wherein, based at least in part on determining the subsequent voice query comprises the audio input from the user that is associated with the grouping classification and that the subsequent voice query does not comprise the subsequent obsequious expression, the method further comprises causing the reply to the subsequent voice query to comprise an instructional message to solicit a modified query that includes one or more obsequious expressions.
- 9 . The method of claim 8 , further comprises: receiving the modified query with the one or more obsequious expressions; and performing an action indicated in the modified query with the one or more obsequious expressions.
- 10 . The method of claim 9 , further comprising: determining whether the modified query with the one or more obsequious expressions was received within a predetermined time period from when the instructional message was output; and based at least in part on determining the modified query with the one or more obsequious expressions was received within the predetermined time period from when the instructional message was output, performing the action indicated in the modified query.
- 11 . The method of claim 5 , wherein causing the performance of the action requested by the voice query, based at least in part on determining the voice query corresponds to the audio input from the user that is associated with the grouping classification and that the voice query does comprise the obsequious expression, the method further comprises causing output of a reply to the voice query comprising the obsequious expression.
- 12 . The method of claim 2 wherein the non-obsequious portion of the voice query comprises a reference to a content item.
- 13 . The method of claim 2 , the obsequious portion of the voice query does not describe the non-obsequious portion of the voice query based at least in part on the obsequious portion of the voice query comprises one or more intentional obsequious expressions.
- 14 . The method of claim 13 , further comprising determining that the obsequious portion of the voice query comprises the one or more intentional obsequious expressions by: identifying a text string corresponding to the voice query; determining a context of the obsequious portion within the text string; and determining that the obsequious portion comprises the one or more intentional obsequious expressions based at least in part on the context of the obsequious portion of the voice query within the text string corresponding to the voice query.
- 15 . A computer-implemented system, comprising: control circuitry; and input/output circuitry configured to: receive a voice query; wherein the control circuitry is configured to: determine that the voice query comprises audio input of a user belonging to a grouping classification; determine the voice query comprises an obsequious portion comprising an obsequious expression; identify a non-obsequious portion of the voice query, wherein the obsequious portion of the voice query does not describe the non-obsequious portion of the voice query; and based at least in part on determining that the received voice query comprises the audio input of the user that belongs to the grouping classification and that the received voice query comprises the obsequious portion, cause output of data in relation to the voice query.
- 16 . The system of claim 15 , wherein the grouping classification corresponds to users categorized as children, and wherein determining the voice query comprises the audio input from the user belonging to the grouping classification is based at least in part on the control circuitry further configured to analyze one or more audio characteristics of the audio input to determine that the user is a child.
- 17 . The system of claim 16 , wherein determining the voice query comprises the obsequious expression is performed based at least in part on the control circuitry further configured to determine that the voice query comprises the audio input of the child.
- 18 . The system of claim 15 , wherein: the voice query comprises a request to perform an action; the control circuitry is further configured to: determine that the non-obsequious portion comprises an indication of the action and an indication of a content item; and cause output of the data comprises cause performance of the action indicated in the non-obsequious portion of the voice query, wherein the action relates to the content item.
- 19 . The system of claim 15 , wherein causing output of the data comprises causing output of a voice reply to the voice query.
- 20 . The system of claim 15 , wherein the voice query is received at a first time, and the control circuitry is further configured to: receive a subsequent voice query at a second time; determine the subsequent voice query does not comprise an subsequent obsequious expression, wherein determining the subsequent voice query does not comprise the subsequent obsequious expression is performed based at least in part on determining that the subsequent voice query comprises the audio input of the user belonging to the grouping classification; and cause output of a reply to the subsequent voice query.
Description
CROSS REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 18/746,439, filed Jun. 18, 2024, which is a continuation of U.S. patent application Ser. No. 18/113,984, filed Feb. 24, 2023, now U.S. Pat. No. 12,046,230, which is a continuation of U.S. patent application Ser. No. 16/805,342, filed Feb. 28, 2020, now U.S. Pat. No. 11,626,103, the disclosures of which are incorporated by reference herein in their entireties. BACKGROUND The present disclosure relates to natural language model training systems and methods and, more particularly, to systems and methods related to training and employing natural language models in natural language understanding (NLU) systems operations. SUMMARY No doubt, voice-controlled human machine interfaces have gained notoriety among avid electronic device users. Learning to recognize and process speech, however, is not an easy feat for these interface devices. Large data sets serve as training input to speech recognition models to facilitate reliable speech recognition capability over time, oftentimes over a long time. Generally, the larger the training data set and the longer the training, the more reliable the recognized speech. Correspondingly, text string recognition capability shares similar reliability characteristics. Voice and/or text string recognition technology for certain applications remain in their infancy with improvements yet to be realized. Regardless of the training size or training duration, speech and text recognition suffer from inaccuracies when provided with inputs of inadequate clarity and volume. A soft-spoken voice often falls victim to misinterpretation or no interpretation by a device having voice interface capabilities. Take the case of a 6-year old child for example. Speaking to a device, located 10 or 20 feet away, the 6-year old is unlikely to speak with requisite voice strength and speech clarity for proper speech or text recognition functionality. Unless spoken with clarity and particularly strength of volume, a device using voice input does not and cannot carry out the child's commands, for example. Children are naturally made to speak louder to properly convey their wishes, an outcome that is not without consequence. Habits generally start to take form at an early age, and current voice-recognition technology albeit unintentionally is teaching kids to learn to behave rudely and obnoxiously by loudly voicing a command. Voice-recognition technology manufacturers have attempted to address the foregoing issue by requiring devices with voice interfaces to conform to polite speech, for example, “thank you” or “please” preceding or following a command, such as “change channels” or “play Barney”. In some cases, the device will simply refuse to carry out the command in the absence of detecting an obsequious expression. The Amazon's Echo device, Amazon Fire TV, Amazon Fire Stick, Apple TV, Android mobile devices with Google's “Ok Google” application and the iPhone with Siri serve as examples of devices with voice interface functionality. Some devices go as far as responding to an impolite input query only to remind the user to repeat the command using polite words and not until a polite command follows will the device indeed carry out the command. In response to “play Barney”, for example, the device prevents the show Barney from playing until an alteration of the command is received using an obsequious expression, i.e. “play Barney, please”. Such advancements are certainly notable but other issues remain. Natural language voice recognition systems, such as natural language understanding (NLU) systems, require user utterance training for proper utterance matching in addition to user query recognition and interpretation functionalities. Adding an obsequious expression to a user query as a prefix or a suffix, such as “please” at the end of “play Game of Thrones”, presents challenges to voice-recognition model training. One such challenge is a reduction in match scores of previously trained speeches (or queries). Simply put, in the presence of an obsequious expression, the model fails to recognize an utterance with an equivalent degree of accuracy as its predecessors. Consequently, additional costly and lengthy training techniques may be required. Further, system architecture is made unnecessarily complicated to accommodate additional natural language model training for text strings or speech that include obsequious expressions. Finally, removing obsequious expressions from search queries, while a seemingly viable solution, poses a problem relative to content search applications with entity titles that include such expressions, because removing the expressions from the query yields poor results. For example, the movie title, “Play Thank You for Smoking”, may be reduced to “Play>entity_title<you for smoking>”, which would yield incorrect results. Some of the examples presented in this disclosure are directe