US-12626693-B2 - Modeling attention to improve classification and provide inherent explainability

US12626693B2US 12626693 B2US12626693 B2US 12626693B2US-12626693-B2

Abstract

In an artificial intelligence model (AI model), input data is processed to provide both classification of the input data and a visualization of the process of the AI model. This is done by performing intent and slot classification of the input data, generating weights and binary classifier logits, performing feature fusion and classification. A graphical explanation is then output as a visualization along with logits.

Inventors

Dalkandura Arachchige K.S.S. GUNARATNA
Vijay Srinivasan
Hongxia Jin

Assignees

SAMSUNG ELECTRONICS CO., LTD.

Dates

Publication Date: 20260512
Application Date: 20221208

Claims (17)

1 . A method of visualizing a natural language understanding model, the method comprising: parsing an utterance into a vector of tokens; encoding the utterance with an encoder to obtain a vector of token embeddings; applying an intent classifier, based on the vector of token embeddings, to obtain an estimated intent; obtaining a vector of slot type weights for visualization, wherein the obtaining the vector of slot type weights uses an auxiliary network and is based on the vector of token embeddings and based on the intent logits, wherein obtaining the vector of slot type weights comprises, for each slot type, applying a binary classifier to the vector of multiple self-attentions to obtain the vector of slot type weights; obtaining a vector of multiple self-attentions, wherein the obtaining the vector of multiple self-attentions uses the auxiliary network and is based on the vector of token embeddings and based on the intent logits, wherein the estimated intent includes a vector of intent logits, and wherein the obtaining the vector of multiple self-attentions comprises: concatenating the vector of intent logits with the vector of token embeddings to obtain an expanded intent logits vector, and obtaining the vector of multiple self-attentions based on the expanded intent logits vector; visualizing the vector of slot type weights in a two column format, wherein the two column format comprises a first column and a second column; performing a feature fusion based on the vector of slot type weights and based on the vector of token embeddings to obtain a vector of fused features; and obtaining, based on the vector of fused features and using a slot classifier, a vector of classified slots corresponding to the utterance.
2 . The method of claim 1 , further comprising outputting the vector of classified slots for fulfillment by a voice-activated artificial intelligence-based personal assistant.
3 . The method of claim 1 , wherein the visualizing the vector of slot type weights comprises providing a visual presentation including the first column and the second column with bars connecting column entries from the first column to the second column, the first column and the second column both listing the vector of tokens, wherein a first bar corresponds to a correspondence between a first token in the first column with a second token in the second column.
4 . The method of claim 1 , wherein a training of the vector of slot type weights is based on an output of the binary classifier.
5 . The method of claim 4 , wherein the performing the feature fusion comprises: computing a cross-attention vector based on the vector of slot type weights and based on the vector of token embeddings; forming an intermediate vector as a sum of the cross-attention vector and the vector of token embeddings; and forming the vector of fused features based on applying the intermediate vector to a linear layer and normalizing an output of the linear layer.
6 . The method of claim 1 , further comprising: determining a vector of slot type attentions; wherein the applying the slot classifier comprises operating on the vector of slot type specific attentions; wherein the performing the feature fusion is further based on the vector of slot type specific attentions; and wherein the method further comprises visualizing the vector of slot type specific attentions.
7 . A server for utterance recognition and model visualization, the server comprising: one or more processors; and one or more memories, the one or more memories storing a program, wherein execution of the program by the one or more processors is configured to cause the server to at least: parse an utterance into a vector of tokens; encode the utterance with an encoder to obtain a vector of token embeddings; apply an intent classifier, based on the vector of token embeddings, to obtain an estimated intent; obtain a vector of slot type weights for visualization, wherein the obtaining the vector of slot type weights uses an auxiliary network and is based on the vector of token embeddings and based on the intent logits, wherein execution of the program by the one or more processors is further configured to obtain the vector of slot type weights by, for each slot type, applying a binary classifier to the vector of multiple self-attentions to obtain the vector of slot type weights; obtain a vector of multiple self-attentions, wherein the obtaining the vector of multiple self-attentions uses the auxiliary network and is based on the vector of token embeddings and based on the intent logits, wherein the estimated intent includes a vector of intent logits, and wherein execution of the program by the one or more processors is further configured to obtain the vector of multiple self-attentions by: concatenating the vector of intent logits with the vector of token embeddings to obtain an expanded intent logits vector, and obtaining the vector of multiple self-attentions based on the expanded intent logits vector; visualize the vector of slot type weights in a two column format, wherein the two column format comprises a first column and a second column; perform a feature fusion based on the vector of slot type weights and based on the vector of token embeddings to obtain a vector of fused features; and obtain, based on the vector of fused features and using a slot classifier, a vector of classified slots corresponding to the utterance.
8 . The server of claim 7 , wherein execution of the program by the one or more processors is further configured to cause the server to output the vector of classified slots for fulfillment by a voice-activated artificial intelligence-based personal assistant.
9 . The server of claim 7 , wherein execution of the program by the one or more processors is further configured to provide information for a debugging engineer to alter the intent classifier and/or the slot classifier and/or model training data based on the vector of slot type weights visualized in the two column format.
10 . The server of claim 7 , wherein execution of the program by the one or more processors is further configured to visualize the vector of slot type weights by providing a visual presentation including the first column and the second column with bars connecting column entries from the first column to the second column, the first column and the second column both listing the vector of tokens, wherein a first bar corresponds to a correspondence between a first token in the first column with a second token in the second column, thereby permitting a person to recognize focus points on the utterance relevant to the classification by a natural language understanding model.
11 . The server of claim 7 , wherein a training of the vector of slot type weights is based on an output of the binary classifier.
12 . The server of claim 11 , wherein execution of the program by the one or more processors is further configured to perform the feature fusion by: computing a cross-attention vector based on the vector of slot type weights and based on the vector of token embeddings; forming an intermediate vector as a sum of the cross-attention vector and the vector of token embeddings; and forming the vector of fused features based on a applying the intermediate vector to a linear layer and normalizing an output of the linear layer.
13 . The server of claim 7 , wherein execution of the program by the one or more processors is further configured to: determine a vector of special slot type specific attentions; wherein execution of the program by the one or more processors is further configured to apply the slot classifier by operating on the vector of special slot type specific attentions; wherein execution of the program by the one or more processors is further configured to perform the feature fusion based on the vector of special slot type specific attentions; and wherein execution of the program by the one or more processors is further configured to visualize the vector of special slot type specific attentions.
14 . A non-transitory computer readable medium configured to store a program for utterance recognition and model visualization, wherein execution of the program by one or more processors of a server is configured to cause the server to at least: parse an utterance into a vector of tokens; encode the utterance with an encoder to obtain a vector of token embeddings; apply an intent classifier, based on the vector of token embeddings, to obtain an estimated intent; obtain a vector of slot type weights for visualization, wherein the obtaining the vector of slot type weights uses an auxiliary network and is based on the vector of token embeddings and based on the intent logits, wherein execution of the program by the one or more processors is further configured to obtain the vector of slot type weights by, for each slot type, applying a binary classifier to the vector of multiple self-attentions to obtain the vector of slot type weights; obtain a vector of multiple self-attentions, wherein the obtaining the vector of multiple self-attentions uses the auxiliary network and is based on the vector of token embeddings and based on the intent logits, wherein the estimated intent includes a vector of intent logits, and wherein execution of the program by the one or more processors is further configured to obtain the vector of multiple self-attentions by: concatenating the vector of intent logits with the vector of token embeddings to obtain an expanded intent logits vector, and obtaining the vector of multiple self-attentions based on the expanded intent logits vector; visualize the vector of slot type weights in a two column format, wherein the two column format comprises a first column and a second column; perform a feature fusion based on the vector of slot type weights and based on the vector of token embeddings to obtain a vector of fused features; and obtain, based on the vector of fused features and using a slot classifier, a vector of classified slots corresponding to the utterance.
15 . The non-transitory computer readable medium of claim 14 , wherein execution of the program by the one or more processors of the server is configured to cause the server to output the vector of classified slots for fulfillment by a voice-activated artificial intelligence-based personal assistant.
16 . The non-transitory computer readable medium of claim 14 , wherein execution of the program by the one or more processors of the server is configured to provide information for a debugging engineer to alter the intent classifier and/or the slot classifier based on the vector of slot type weights visualized in the two column format.
17 . The non-transitory computer readable medium of claim 14 , wherein execution of the program by the one or more processors of the server is configured to visualize the vector of slot type weights by providing a visual presentation including the first column and the second column with bars connecting column entries from the first column to the second column, the first column and the second column both listing the vector of tokens, wherein a first bar corresponds to a correspondence between a first token in the first column with a second token in the second column, thereby permitting a person to recognize focus points on the utterance relevant to the classification by a natural language understanding model.

Description

CROSS REFERENCE TO RELATED APPLICATION This application claims benefit of priority of U.S. Provisional Application No. 63/307,592 filed Feb. 7, 2022, the contents of which are hereby incorporated by reference. FIELD The present disclosure is related to artificial intelligence performing classification of input data. BACKGROUND The present application relates to classification of input data. In an example, the present application discusses joint intent detection and slot filling for natural language understanding (NLU). Existing systems learn features collectively over all slot types (i.e., labels) and have no way to explain the model. A lack of explainability creates doubt in a user as to what a model is doing. A lack of explainability also makes improving the model difficult when errors occur. Adding explainability by an additional process unrelated to intent detection and slot filling reduces efficiency and correctness of explanations. SUMMARY Embodiments provided herein provide classification (inference of mapping input data to one particular class from a set of classes or mapping input data to soft values, one soft value for each class of the set of classes) and explainability (visual outputs that explain how an AI model arrived at a classification). In an artificial intelligence model (AI model) of embodiments provided herein, an utterance is processed to provide both classification of the utterance and a visualization of the process of the AI model. This is done by performing intent classification of the utterance, generating slot type weights and binary classifier logits, performing feature fusion and slot classification. A graphical slot explanation is then output as a visualization along with slot logits. Based on the output, voice-activated AI-based personal assistant can take action on the input utterance. Also, a debugging engineer is assisted by the visualization in a task of improving the AI model. Provided herein is a method of visualizing a natural language understanding model, the method including: parsing an utterance into a vector of tokens; encoding the utterance with an encoder to obtain a vector of token embeddings; applying an intent classifier, based on the vector of token embeddings, to obtain an estimated intent; obtaining a vector of slot type weights for visualization. The obtaining the vector of slot type weights uses an auxiliary network and is based on the vector of token embeddings and based on the estimated intent; obtaining a vector of multiple self-attentions. The obtaining the vector of multiple self-attentions uses the auxiliary network and is based on the vector of token embeddings and based on the estimated intent; the method includes visualizing the vector of slot type weights in a two column format. The two column format comprises a first column and a second column; the method includes performing a feature fusion based on the vector of slot type weights and based on the vector of token embeddings to obtain a vector of fused features; and obtaining, based on the vector of fused features and using a slot classifier, a vector of classified slots corresponding to the utterance. Also provided herein is a server for utterance recognition and model visualization, the server including: one or more processors; and one or more memories, the one or more memories storing a program, wherein execution of the program by the one or more processors is configured to cause the server to at least: parse an utterance into a vector of tokens; encode the utterance with an encoder to obtain a vector of token embeddings; apply an intent classifier, based on the vector of token embeddings, to obtain an estimated intent; obtain a vector of slot type weights for visualization, wherein the obtaining the vector of slot type weights uses an auxiliary network and is based on the vector of token embeddings and based on the estimated intent; obtain a vector of multiple self-attentions, wherein the obtaining the vector of multiple self-attentions uses the auxiliary network and is based on the vector of token embeddings and based on the estimated intent; visualize the vector of slot type weights in a two column format, wherein the two column format comprises a first column and a second column; perform a feature fusion based on the vector of slot type weights and based on the vector of token embeddings to obtain a vector of fused features; and obtain, based on the vector of fused features and using a slot classifier, a vector of classified slots corresponding to the utterance. Also provided herein is a non-transitory computer readable medium configured to store a program for utterance recognition and model visualization, wherein execution of the program by one or more processors of a server is configured to cause the server to at least: parse an utterance into a vector of tokens; encode the utterance with an encoder to obtain a vector of token embeddings; apply an intent classifier, based on the vector