US-12619834-B2 - Systems and methods for intent classification in a natural language processing agent

US12619834B2US 12619834 B2US12619834 B2US 12619834B2US-12619834-B2

Abstract

Embodiments described herein provide a cross-lingual intent classification model that predicts in multiple languages without the need of training data in all the multiple languages. For example, data requirement for training can be reduced to just one utterance per intent label. Specifically, when an utterance is fed to the intent classification model, the model checks whether the utterance is similar to any of the example utterances provided for each intent. If any such utterance(s) are found, the model returns the specified intent, otherwise, it returns out of domain (OOD).

Inventors

Shubham Mehrotra
ZACHARY ALEXANDER
Shilpa Bhagavath
Gurkirat Singh
Shashank HARINATH
Anuprit Kale

Assignees

SALESFORCE, INC.

Dates

Publication Date: 20260505
Application Date: 20230130

Claims (17)

1 . A method of operating a cross-lingual intent classification framework in a conversation agent, the method comprising: receiving, via a communication interface for the cross-lingual intent classification framework implemented on one or more processors, a training dataset of utterances annotated with intent labels corresponding to a plurality of pre-defined intent classes, respectively, wherein the cross-lingual intent classification framework comprises a dual encoder architecture of a pretrained multi-lingual language model and a weighted K-nearest neighbor classifier as a decoder; encoding, via a shared embedding neural network in the dual encoder architecture of the pretrained multi-lingual language model, a first set of utterances from the training dataset into a first set of embeddings in a feature space; receiving, via a communication interface, a testing utterance; encoding separately, via the shared embedding neural network in the dual encoder architecture of the pretrained multi-lingual language model, the testing utterance into a first embedding in the feature space; computing a first set of distances between the first embedding and the first set of embeddings, respectively; determining a set of nearest neighboring embeddings from the first set of embeddings, which are nearest neighbors to the first embedding based on the first set of distances; removing, from the set of nearest neighboring embeddings, one or more neighboring embeddings that are out of range from the first embedding based on at least one inter-domain threshold associated with the plurality of pre-defined intent classes; decoding, via the weighted K-nearest neighbor classifier, a first intent label for the testing utterance based on remaining neighboring embeddings in the set of nearest neighboring embeddings; and generating, by a question answering model, a response to the testing utterance at least in part based on the first intent label.
2 . The method of claim 1 , wherein the pretrained language model is a multilingual language model, and wherein the training dataset comprises utterances and/or intent labels in multiple languages.
3 . The method of claim 1 , wherein each distance in the first set of distances is computed as a semantic similarity between the testing utterance and a training utterance corresponding to an embedding in the first set of embeddings.
4 . The method of claim 1 , further comprising: encoding, via the pretrained language model, a training utterance into a second embedding in the feature space; computing a second set of distances between the second embedding and the first set of embeddings, respectively; determining, via the weighted K-nearest neighbor classifier, a second intent label for the training utterance; computing a loss based on a difference between the second intent label and an annotated label associated with the training utterance; and updating the weighted K-nearest neighbor classifier based on the loss while keeping the pretrained language model frozen.
5 . The method of claim 1 , further comprising: computing, for each training utterance that belongs to a first intent class, a respective set of distances between the each training utterance and other training utterances in all other intent classes, respectively; determining the at least one inter-domain threshold according to a percentage of all computed distances; determining whether the set of nearest neighboring embeddings are within a range of the at least one inter-domain threshold to the first embedding; and in response to determining that no neighboring embedding is under the inter-domain threshold to the first embedding, determining that the testing utterance is OOD.
6 . The method of claim 1 , wherein a value of K is determined by a minimum number of utterances that belong to a same intent class in the first set of utterances.
7 . A system of operating a cross-lingual intent classification framework in a conversation agent, the system comprising: a communication interface receiving a training dataset of utterances annotated with intent labels corresponding to a plurality of pre-defined intent classes, respectively, respectively and a testing utterance; a memory storing the cross-lingual intent classification framework comprising a dual encoder architecture of a pretrained multi-lingual language model, a weighted K-nearest neighbor classifier as a decoder, and a plurality of processor-executable instructions; and one or more processors executing the plurality of processor-executable instructions to perform operations comprising: encoding, via a shared embedding neural network in a dual encoder architecture of the pretrained multi-lingual language model, a first set of utterances from the training dataset into a first set of embeddings in a feature space; receiving, via a communication interface, a testing utterance; encoding separately, via the shared embedding neural network in the dual encoder architecture of the pretrained multi-lingual language model, the testing utterance into a first embedding in the feature space; computing a first set of distances between the first embedding and the first set of embeddings, respectively; determining a set of nearest neighboring embeddings from the first set of embeddings, which are nearest neighbors to the first embedding based on the first set of distances; removing, from the set of nearest neighboring embeddings, one or more neighboring embeddings that are out of range from the first embedding based on at least one inter-domain threshold associated with the plurality of pre-defined intent classes; decoding, via the weighted K-nearest neighbor classifier, a first intent label for the testing utterance based on remaining neighboring embeddings in the set of nearest neighboring embeddings; and generating, by a question answering model, a response to the testing utterance at least in part based on the first intent label.
8 . The system of claim 7 , wherein the pretrained language model is a multilingual language model, and wherein the training dataset comprises utterances and/or intent labels in multiple languages.
9 . The system of claim 7 , wherein each distance in the first set of distances is computed as a semantic similarity between the testing utterance and a training utterance corresponding to an embedding in the first set of embeddings.
10 . The system of claim 7 , wherein the operations further comprise: encoding, via the pretrained language model, a training utterance into a second embedding in the feature space; computing a second set of distances between the second embedding and the first set of embeddings, respectively; determining, via the weighted K-nearest neighbor classifier, a second intent label for the training utterance; computing a loss based on a difference between the second intent label and an annotated label associated with the training utterance; and updating the weighted K-nearest neighbor classifier based on the loss while keeping the pretrained language model frozen.
11 . The system of claim 7 , wherein the operations further comprise: computing, for each training utterance that belongs to a first intent class, a respective set of distances between the each training utterance and other training utterances in all other intent classes, respectively; determining the at least one inter-domain threshold according to a percentage of all computed distances; determining whether the set of nearest neighboring embeddings are within a range of the at least one inter-domain threshold to the first embedding; and in response to determining that no neighboring embedding is under the inter-domain threshold to the first embedding, determining that the testing utterance is OOD.
12 . The system of claim 7 , wherein a value of K is determined by a minimum number of utterances that belong to a same intent class in the first set of utterances.
13 . A non-transitory processor-readable storage medium storing processor-executable instructions for operating a cross-lingual intent classification framework in a conversation agent, the instructions being executed by one or more processors to perform operations comprising: receiving, via a communication interface for the cross-lingual intent classification framework implemented on one or more processors, a training dataset of utterances annotated with intent labels corresponding to a plurality of pre-defined intent classes, respectively, wherein the cross-lingual intent classification framework comprises a dual encoder architecture of a pretrained multi-lingual language model and a weighted K-nearest neighbor classifier as a decoder; encoding, via a shared embedding neural network in a dual encoder architecture of the pretrained multi-lingual language model, a first set of utterances from the training dataset into a first set of embeddings in a feature space; receiving, via a communication interface, a testing utterance; encoding separately, via the shared embedding neural network in the dual encoder architecture of the pretrained multi-lingual language model, the testing utterance into a first embedding in the feature space; computing a first set of distances between the first embedding and the first set of embeddings, respectively; determining a set of nearest neighboring embeddings from the first set of embeddings, which are nearest neighbors to the first embedding based on the first set of distances; removing, from the set of nearest neighboring embeddings, one or more neighboring embeddings that are out of range from the first embedding based on at least one inter-domain threshold associated with the plurality of pre-defined intent classes; decoding, via the weighted K-nearest neighbor classifier, a first intent label for the testing utterance based on remaining neighboring embeddings in the set of nearest neighboring embeddings; and generating, by a question answering model, a response to the testing utterance at least in part based on the first intent label.
14 . The medium of claim 13 , wherein the pretrained language model is a multilingual language model, and wherein the training dataset comprises utterances and/or intent labels in multiple languages.
15 . The medium of claim 13 , wherein each distance in the first set of distances is computed as a semantic similarity between the testing utterance and a training utterance corresponding to an embedding in the first set of embeddings.
16 . The medium of claim 13 , wherein the operations further comprise: encoding, via the pretrained language model, a training utterance into a second embedding in the feature space; computing a second set of distances between the second embedding and the first set of embeddings, respectively; determining, via the weighted K-nearest neighbor classifier, a second intent label for the training utterance; computing a loss based on a difference between the second intent label and an annotated label associated with the training utterance; and updating the weighted K-nearest neighbor classifier based on the loss while keeping the pretrained language model frozen.
17 . The medium of claim 13 , wherein the operations further comprise: computing, for each training utterance that belongs to a first intent class, a respective set of distances between the each training utterance and other training utterances in all other intent classes, respectively; determining the at least one inter-domain threshold according to a percentage of all computed distances; determining whether the set of nearest neighboring embeddings are within a range of the at least one inter-domain threshold to the first embedding; and in response to determining that no neighboring embedding is under the inter-domain threshold to the first embedding, determining that the testing utterance is OOD.

Description

CROSS REFERENCES The instant application is a nonprovisional of and claims priority to 35 U.S.C. 119 to U.S. provisional application No. 63/381,271, filed Oct. 27, 2022, which is hereby expressly incorporated by reference herein in its entirety. TECHNICAL FIELD The embodiments relate generally to natural language processing and machine learning systems, and more specifically to systems and methods for intent classification to provide natural language understanding in a conversation agent. BACKGROUND Machine learning systems have been widely used in automatic conversational systems such as an intelligent chatbot in customer service, online learning, and/or the like. For example, a chatbot may assist a user to navigate through different task scenarios, such as travel arrangements, event planning, banking services, and/or the like. A machine learning model is usually trained on a large conversation corpus to generate a system response when a user utterance is received. Traditionally, different machine learning models may be trained per different language, and even different domains (such as booking, healthcare, information technology support, banking customer service, and/or the like) to provide specific service to users. Therefore, there is a need for an integrated framework that provides service in multiple languages across different domains with an efficient training scheme. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a simplified block diagram illustrating an example aspect of intent aspect with an intelligent chatbot conducting a conversation with a human user, according to embodiments described herein. FIG. 2A shows an example structure of the intent classification framework, according to embodiments described herein. FIG. 2B shows an example structure of the intent classification framework with out-of-domain (OOD) identification, according to embodiments described herein. FIG. 3 is a simplified diagram illustrating a computing device implementing the intent classification framework described in FIGS. 2A-2B, according to one embodiment described herein. FIG. 4 is a simplified block diagram of a networked system suitable for implementing the intent classification framework described in FIGS. 2A-2B and other embodiments described herein. FIG. 5 is an example logic flow diagram illustrating a method of intent classification training based on the framework shown in FIGS. 1-2B, according to some embodiments described herein. FIG. 6 is an example logic flow diagram illustrating a method of intent classification training with OOD detection based on the framework shown in FIGS. 1-2B, according to some embodiments described herein. FIGS. 7A-8B provide example data plots illustrating data experiment performance, according to embodiments described herein. Embodiments of the disclosure and their advantages are best understood by referring to the detailed description that follows. It should be appreciated that like reference numerals are used to identify like elements illustrated in one or more of the figures, wherein showings therein are for purposes of illustrating embodiments of the disclosure and not for purposes of limiting the same. DETAILED DESCRIPTION As used herein, the term “network” may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith. As used herein, the term “module” may comprise hardware or software-based framework that performs one or more functions. In some embodiments, the module may be implemented on one or more neural networks. Machine learning systems have been widely used in automatic conversational systems such as an intelligent chatbot in customer service, online learning, and/or the like. Intent classification generates an intent label for a user utterance so as to provide natural language understanding (NLU) to enable the intelligent chatbot to redirect the conversation flow as desired based on the identified user intent. For example, FIG. 1 is a simplified block diagram illustrating an example aspect of intent aspect with an intelligent chatbot conducting a conversation with a human user, according to embodiments described herein. As shown in FIG. 1, an intelligent service chatbot 105 may be implemented on a user device 110 (e.g., as further described with computing device 300 in FIG. 3, or user device 410 in FIG. 4) to conduct a conversation with a user. When a user says “I want to book an appointment on next Monday morning at 11 am” (e.g., 106), an intent classification model of the intelligent chatbot may classify the user intent of the user utterance 106 as “Request_Appointment” at 108. In this way, the intelligent chatbot 105 may generate a response 109 of making an appointment in response to the user utterance 106 based on the user intent “Request_Appointment.” Traditionally, an intent classification model is trained on a large c