EP-4462323-B1 - EXAMPLE-BASED VOICE BOT DEVELOPMENT TECHNIQUES

EP4462323B1EP 4462323 B1EP4462323 B1EP 4462323B1EP-4462323-B1

Inventors

AHARONI, ASAF
LEVIATHAN, YANIV
SEGALIS, EYAL
ELIDAN, Gal
GOLDSHTEIN, Sasha
AMIAZ, TOMER
COHEN, Deborah

Dates

Publication Date: 20260513
Application Date: 20211203

Claims (15)

A method implemented by one or more processors, the method comprising: obtaining, via a voice bot development platform, a plurality of training instances, each of the plurality of training instances including: training instance input, the training instance input including at least a portion of a corresponding conversation and a prior context of the corresponding conversation, and training instance output, the training instance output including a corresponding ground truth response to at least the portion of the corresponding conversation; the method being characterised by further comprising; obtaining, via the voice bot development platform, a corresponding feature emphasis input associated with one or more of the plurality of training instances; training, via the voice bot development platform, a voice bot based on the plurality of training instances and the corresponding feature emphasis input associated with one or more of the plurality of training instances, wherein the corresponding feature emphasis input associated with one or more of the plurality of training instances attentions the voice bot to a particular feature of the portion of the corresponding conversation; and subsequent to training the voice bot: causing the trained voice bot to be deployed for conducting conversations on behalf of a third-party.
The method of claim 1, wherein training the voice bot comprises: processing, using a plurality of machine learning (ML) layers of a ML model, and for a given training instance of the plurality of training instances, at least the portion of the corresponding conversation and the prior context of the corresponding conversation to generate an embedding associated with a current state of the corresponding conversation.
The method of claim 2, wherein the portion of the corresponding conversation comprises a plurality of speech hypotheses for at least the portion of the corresponding conversation, and wherein processing at least the portion of the corresponding conversation and the prior context of the corresponding conversation to generate the embedding associated with the current state of the corresponding conversation comprises: processing, using first ML layers of the plurality of ML layers, the plurality of speech hypotheses to generate a first embedding, processing, using second ML layers of the plurality of ML layers, the prior context of the corresponding conversation to generate a second embedding, and concatenating the first embedding and the second embedding to generate the embedding associated with the current state of the corresponding conversation.
The method of claim 3, further comprising: generating, via the voice bot development platform, a plurality of affinity features based on the embedding associated with the current state of the corresponding conversation.
The method of claim 4, wherein training the voice bot further comprises: processing, using a plurality of additional ML layers of the ML model or an additional ML model, the plurality of affinity features and the embedding associated with the current state of the corresponding conversation to generate a predicted embedding associated with a predicted response to at least the portion of the corresponding conversation.
The method of claim 5, wherein training the voice bot further comprises: comparing, in embedding space, the predicted embedding associated with the predicted response to at least the portion of the corresponding conversation and a corresponding ground truth embedding associated with the corresponding ground truth response to at least the portion of the corresponding conversation; generating, based on comparing the predicted embedding and the corresponding ground truth embedding, one or more losses; and updating the ML model based on one or more of the losses and the corresponding feature emphasis input associated with the given training instance.
The method of claim 6, wherein the ML model is a transformer model that includes one or more attention mechanisms, and wherein updating the transformer model based on one or more of the losses and the corresponding feature emphasis input associated with the given training instance comprises: causing weights of one or more of the plurality of ML layers or the plurality of additional ML layers to be updated based on one or more of the losses; and causing the one or more of the attention mechanisms of the transformer model to be attentioned to one or more features of at least the portion of the corresponding conversation based on the corresponding feature emphasis input associated with the given training instance.
The method of claim 3, wherein the portion of the corresponding conversation further comprises audio data corresponding to a spoken utterance that captures at least the portion of the corresponding conversation, and wherein the plurality of speech hypotheses are generated based on processing, using an automatic speech recognition (ASR) model, the audio data corresponding to the spoken utterance to generate the plurality of speech hypotheses for at least the portion of the corresponding conversation.
The method of claim 8, further comprising: aligning one or more corresponding textual segments associated with each of the plurality of speech hypotheses; annotating each of the one or more corresponding textual segments with at least one corresponding label to generate a plurality of annotated speech hypotheses; and wherein processing the plurality of speech hypotheses to generate the first embedding using the first ML layers of the plurality of ML layers comprises: processing the plurality of annotated speech hypotheses to generate the first embedding.
The method of claim 3, wherein the prior context of the corresponding conversation includes at least one or more prior portions of the corresponding conversation, and wherein the one or more prior portions of the corresponding conversation occur, in the corresponding conversation, before at least the portion of the corresponding conversation.
The method of any preceding claim, wherein obtaining the corresponding feature emphasis input associated with one or more of the plurality of training instances comprises: receiving natural language input from one or more humans associated with the third-party, wherein the natural language input is one or more of: free-form spoken input or free-form typed input; and processing the natural language input to obtain the corresponding feature emphasis input associated with one or more of the plurality of training instances.
The method of any preceding claim, wherein one or more of the plurality of training instances are obtained from a corpus of training instances, and wherein the corpus of training instances including a plurality of previous conversations between multiple humans, wherein one or more of the plurality of training instances are obtained from a corresponding demonstrative conversation between one or more humans, and wherein one or more of the humans are associated with the third-party, and/or wherein one or more of the plurality of training instances are obtained from a spoken utterance received via the voice bot development, and wherein the spoken utterances are received from one or more humans associated with the third-party.
The method of any preceding claim, wherein causing the trained voice bot to be deployed for conducting the conversations on behalf of the third-party comprises causing the trained voice bot to be deployed for conducting the conversations for telephone calls associated with the third-party, and wherein causing the trained voice bot to be deployed for conducting the conversations for the telephone calls associated with the third-party comprises: causing the voice bot to answer corresponding incoming telephone calls and to conduct the conversations with corresponding humans that initiated the corresponding incoming telephone calls via respective client devices.
A system comprising: at least one processor; and memory storing instructions that, when executed, cause the at least one processor to be operable to perform the method according to any one of claims 1 to 13.
A non-transitory computer-readable storage medium storing instructions that, when executed, cause the at least one processor to be operable to perform the method according to any one of claims 1 to 13.

Description

Background Humans may engage in human-to-computer dialogs with interactive software applications referred to as "bots", "chatbots," "automated assistants", "interactive personal assistants," "intelligent personal assistants," "conversational agents," etc. via a variety of computing devices. As one example, these bots can initiate telephone calls or answer incoming telephone calls, and conduct conversations with humans to perform action(s) on behalf of a third-party. However, functionality of these bots may be limited by pre-defined intent schemas that the bots utilize to perform the action(s). In other words, if a human that is engaged in a dialog with a bot provides a spoken utterance that includes an intent not defined by the predefined intent schemas, then the bot will fail. Further, to update these bots, existing intent schemas may be modified or new intent schemas may be added. However, there are virtually limitless intent schemas that may need to be defined to make the bots robust to various nuances of human speech. Extensive utilization of computational resources is required to manually define and/or manually refine such intent schemas. Further, even if a large quantity of intent schemas are defined, a large amount of memory is required to store and/or utilize the large quantity of intent schemas. Accordingly, intent schemas are not practically scalable to the extent of learning the nuances of human speech. In the prior art, it is known from the patent application US2020/0090651A1 techniques for generating dialogue responses based on a machine-learning model (ML). The ML model is trained with training utterances and ground truth responses and uses an attention mechanism. It is known from the patent application US2020/0097544A1 using a corpus of conversation data to train a response recommendation model considering context information. It is known from the patent application US2019/0378015A1 techniques for deploying a trained ML model that may be transformed to reduce the resources required to update the ML model. It is also known from the patent application US2020/0344185A1 a digital assistant builder platform. Summary The invention provides a method implemented by one or more processors according to claim 1, a system according to claim 14 and a non-transitory computer-readable storage medium storing instructions according to claim 15. Preferable aspects are defined by the dependent claims. Implementations disclosed herein are directed to providing a voice bot development platform that enables a voice bot associated with a third-party to be trained based on a plurality of training instances. The voice bot can correspond to one or more processors that utilize a plurality of machine learning (ML) layers, of one or more ML models, for conducting conversations, on behalf of the third-party, for telephone calls associated with the third-party. The voice bot development platform can obtain the plurality of training instances based on user input, from a third-party developer and via a client device associated with the third-party developer, directed to the voice bot development platform. The telephone calls associated with the third-party can include incoming telephone calls initiated by a human via a respective client device and directed to the third-party, and/or outgoing telephone calls initiated by the voice bot via the voice bot development platform and directed to the human or an additional third-party associated with the human. Further, the telephone calls associated with the third-party can be performed using various voice communication protocols (e.g., Voice over Internet Protocol (VoIP), public switched telephone networks (PSTN), and/or other telephonic communication protocols. For example, assume the third-party for which the voice bot is being trained is a fictitious restaurant entity named Hypothetical Café. Further assume a plurality of training instances for training the voice bot associated with Hypothetical Café are obtained via the voice bot development platform. In this example, the voice bot may subsequently answer incoming telephone calls and perform one or more actions related to restaurant reservations, hours of operation inquiries, carryout orders, and/or any other actions associated with incoming telephone calls directed to Hypothetical Café may be performed during the telephone conversation. Further, the voice bot may additionally or alternatively initiate performing of outgoing telephone calls and perform one or more actions related to inventory orders, information technology requests, and/or any other actions associated with the outgoing telephone calls on behalf of Hypothetical Café may be performed during the telephone conversation. Notably, multiple respective instances of the voice bot may be deployed such that the respective instances of the voice bot can engage in multiple respective conversations with respective humans at any given time. For example, each instance of the v