CN-121986350-A - System and method for identifying variants of phrases in text paragraphs

CN121986350ACN 121986350 ACN121986350 ACN 121986350ACN-121986350-A

Abstract

A system and method of identifying, by at least one processor, occurrence of a semantic variation of a phrase in a paragraph may include computing a phrase embedding vector representing a semantic meaning of the phrase, extracting at least one hierarchical set of nested sequences of words from a textual representation of the paragraph, computing, for each sequence, a corresponding sequence embedding vector representing the semantic meaning of the sequence, computing, for one or more sequence embedding vectors, a corresponding vector similarity value representing similarity of the sequence embedding vector to the phrase embedding vector, identifying a sequence corresponding to a maximum vector similarity value of the one or more vector similarity values, and determining the identified sequence as the semantic variation of the phrase based on the maximum vector similarity value.

Inventors

E. Aubach
A. Fazakov
L Haijin
DAVID NATHALIE
R. Moaz

Assignees

吉尼赛斯云服务有限公司

Dates

Publication Date: 20260505
Application Date: 20240822
Priority Date: 20231010

Claims (20)

1. A method of identifying, by at least one processor, occurrence of a semantic variant of a phrase in a paragraph, the method comprising: calculating a phrase embedding vector representing semantic meanings of the phrase based on the text representation of the phrase; Obtaining a text representation of the paragraph, the text representation comprising a plurality of words; extracting at least one hierarchical set of nested sequences of words from the text representation of the paragraph, wherein each sequence in the hierarchical set forms a subset of words of a subsequent sequence in the hierarchical set; for each sequence, computing a corresponding sequence embedding vector representing the semantic meaning of the sequence; for one or more sequence embedded vectors, computing a corresponding vector similarity value representing similarity to the phrase embedded vector; Identifying a sequence corresponding to a maximum vector similarity value of the one or more vector similarity values, and Based on the maximum vector similarity value, the identified sequence is determined as a semantic variant of the phrase.
2. The method of claim 1, wherein the at least one hierarchical set comprises a plurality of hierarchical sets.
3. The method of claim 2, wherein extracting the plurality of hierarchical sets comprises: selecting a plurality of kernel sequences, each kernel sequence including one or more words of the paragraph, and For each core sequence, a respective hierarchical set of nested sequences is generated, each nested sequence comprising the core sequence and a subsequent monotonically increasing nested sequence of words.
4. The method of claim 3, wherein the textual representation of the paragraph is a transcription of a conversation, and wherein selecting the kernel sequence comprises: Selecting the transcribed portion associated with the particular speaker in the conversation, and The kernel sequence is selected to include one or more words of the selected portion.
5. The method of claim 3, wherein selecting the kernel sequence comprises: computing one or more corresponding tags representing part of speech (POS) for one or more words of the paragraph, and The kernel sequence is selected to include one or more words of the paragraph based on the calculated POS tag.
6. The method of claim 3, wherein selecting the kernel sequence comprises: calculating one or more corresponding term relevance metrics for one or more words of the paragraph, and The kernel sequence is selected to include one or more words of the paragraph based on the calculated term relevance metrics.
7. The method of claim 1, wherein computing a sequence-specific embedding vector comprises: obtaining a Machine Learning (ML) based model that is pre-trained to map between text representations of words and corresponding word embedding vectors; Inferring the ML-based model over one or more words of the particular sequence based on the training to produce one or more corresponding word embedding vectors, and The sequence embedding vector is calculated as a function of the one or more word embedding vectors.
8. The method of claim 7, wherein training the ML model comprises: Inferring the ML-based model on a first word to produce a first intermediate word embedding vector representing a semantic meaning of the first word; inferring the ML-based model on a second word to produce a second intermediate word embedding vector representing a semantic meaning of the second word; receiving a first annotation data element representing a semantic similarity between the first word and the second word; Calculating a vector similarity value representing the similarity between the embedded vectors of the intermediate words; Calculating a first loss function value representing a difference between the vector similarity value and the semantic similarity as represented by the first annotation data element, and The ML model is trained to minimize the first loss function value.
9. The method of claim 7, wherein training the ML model comprises: Inferring the ML-based model over a first sequence to produce a first intermediate sequence embedding vector; inferring the ML-based model over a second sequence to produce a second intermediate sequence vector; Receiving a second annotation data element representing sequence semantic similarity between the first sequence and the second sequence; calculating a second vector similarity value representing the similarity between the intermediate sequence embedded vectors; Calculating a second loss function value representing a difference between the second vector similarity value and the sequence semantic similarity as represented by the second annotation data element, and The ML model is trained to minimize the second loss function value.
10. The method of claim 7, wherein training the ML model further comprises: receiving a text representation of the phrase; receiving a paragraph annotation data element, the paragraph annotation data element indicating that a variant of the phrase exists in the paragraph; Inferring the ML-based model over a hierarchical set of sequences obtained from the paragraphs to calculate an intermediate maximum vector similarity value, and The ML model is trained such that the value of the intermediate maximum vector similarity value corresponds to a variant of the phrase occurring in the paragraph as represented by the paragraph annotation.
11. The method of claim 10, wherein the paragraph annotation data element lacks information indicating a position of the variation of the phrase within the received paragraph.
12. A system for identifying occurrence of a semantic variation of a phrase in a paragraph, the system comprising a non-transitory memory device having stored therein an instruction code module, and at least one processor associated with the memory device and configured to execute the instruction code module, the at least one processor, when executing the instruction code module, being configured to: calculating a phrase embedding vector representing semantic meanings of the phrase based on the text representation of the phrase; Obtaining a text representation of the paragraph, the text representation comprising a plurality of words; extracting at least one hierarchical set of nested sequences of words from the text representation of the paragraph, wherein each sequence in the hierarchical set forms a subset of words of a subsequent sequence in the hierarchical set; for each sequence, computing a corresponding sequence embedding vector representing the semantic meaning of the sequence; for one or more sequence embedded vectors, computing a corresponding vector similarity value representing similarity to the phrase embedded vector; Identifying a sequence corresponding to a maximum vector similarity value of the one or more vector similarity values, and Based on the maximum vector similarity value, the identified sequence is determined as a semantic variant of the phrase.
13. The system of claim 12, wherein the at least one hierarchical set comprises a plurality of hierarchical sets.
14. The system of claim 13, wherein the at least one processor is further configured to extract the plurality of hierarchical sets by: selecting a plurality of kernel sequences, each kernel sequence including one or more words of the paragraph, and For each core sequence, a respective hierarchical set of nested sequences is generated, each nested sequence comprising the core sequence and a subsequent monotonically increasing nested sequence of words.
15. The system of claim 14, wherein the textual representation of the paragraph is a transcription of a conversation, and wherein the at least one processor is further configured to select the kernel sequence by: Selecting the transcribed portion associated with the particular speaker in the conversation, and The kernel sequence is selected to include one or more words of the selected portion.
16. The system of claim 14, wherein the at least one processor is further configured to select the kernel sequence by: computing one or more corresponding tags representing part of speech (POS) for one or more words of the paragraph, and The kernel sequence is selected to include one or more words of the paragraph based on the calculated POS tag.
17. The system of claim 14, wherein the at least one processor is further configured to select the kernel sequence by: calculating one or more corresponding term relevance metrics for one or more words of the paragraph, and The kernel sequence is selected to include one or more words of the paragraph based on the calculated term relevance metrics.
18. The system of claim 12, wherein the at least one processor is further configured to calculate a sequence-specific sequence of sequence-embedded vectors by: obtaining a Machine Learning (ML) based model that is pre-trained to map between text representations of words and corresponding word embedding vectors; Inferring the ML-based model over one or more words of the particular sequence based on the training to produce one or more corresponding word embedding vectors, and The sequence embedding vector is calculated as a function of the one or more word embedding vectors.
19. The system of claim 18, wherein the at least one processor is further configured to train the ML model by further: receiving a text representation of the phrase; receiving a paragraph annotation data element, the paragraph annotation data element indicating that a variant of the phrase exists in the paragraph; Inferring the ML-based model over a hierarchical set of sequences obtained from the paragraphs to calculate an intermediate maximum vector similarity value, and The ML model is trained such that the value of the intermediate maximum vector similarity value corresponds to a variant of the phrase occurring in the paragraph as represented by the paragraph annotation.
20. A method of identifying, by at least one processor, occurrence of a semantic variant of a phrase in a paragraph, the method comprising: calculating a phrase embedding vector representing semantic meanings of the phrase based on the text representation of the phrase; Receiving a text representation of the paragraph, the text representation comprising a plurality of n-grams, Extracting a plurality of n-gram sequences from the text representation of the paragraph; for each sequence, computing a corresponding sequence embedding vector representing the semantic meaning of the sequence; Calculating similarity to the phrase embedding vector for one or more sequence embedding vectors, and Based on the calculated similarity, the identified sequence is determined as a semantic variant of the phrase.

Description

System and method for identifying variants of phrases in text paragraphs Cross-reference to related applications and priority claims The present application claims priority from U.S. patent application Ser. No. 18/378,509, entitled "A SYSTEM AND Method Of IDENTIFYING A Variation Of A PHRASE IN A Textual Passage (System and Method Of identifying variants Of phrases in text paragraphs)" filed on 10 month 10 Of 2023. Technical Field The present invention relates generally to automatic text analysis. More particularly, the present invention relates to identifying variants of phrases in text paragraphs. Background Contact centers represent a large number of conversations with various customers per day. A manager or analyst of the contact center may wish to analyze these calls to gain insight into the quality of the service provided by the improvement company or the contact center itself. To impart such analytical process capabilities, the contact center system may open certain functionalities to the "administrator" role that may help the analytical process, in particular, the functionality useful for this is to find a subset of all conversations that handle a particular topic of interest. In order to automatically find what topics are included in a given dialog, it is necessary to describe how such topics are defined. A useful way to define such topics is as a customizable set of typical phrases that may appear in a dialog involving the topic, where a "phrase" may be one or more consecutive words. It is simpler to detect an exact match of a phrase in the dialog text, however compiling a list of all possible variants of the phrase is difficult to solve. Thus, such a system should be intended to efficiently summarize the identification of variants of semantically similar phrases (e.g., where "GET ME A MANAGER (help me manager)" should be considered similar to "TRANSFER ME to a supervisor (help me transit director)") without requiring the user to previously assume all possible expressions. Currently available search systems typically rely on lexical matching to identify semantic similarity. Some conventional approaches treat words as similar, provided they are identical words or originate from identical lexicographic or stemmed forms. Some methods may employ various generalization techniques, such as applying different weights to different words (e.g., by calculating word frequency-inverse document frequency (TF/IDF)) to evaluate the relevance of words within a document to a document collection (corpus). Such a search system may determine that the span of text constitutes the occurrence of a searched phrase if there are sufficient commonly valid words between the examined text and the phrase of interest. Variations on such methods often suffer from low recall. This is because (i) words are typically replaced by synonyms, and (ii) concepts can be expressed in many different ways. Thus, it is necessary to find semantically similar variants of the phrase. Disclosure of Invention Neural network-based methods provide semantic distance metrics that enable the system to quantify similarity to similar and related words, e.g., where "manager" is similar to "supervisor" but introduces different challenges and limitations. The Neural Network (NN) -based approach for detecting phrase similarity can be divided into two branches: Cross encoder-in such implementations, 2 phrases may be concatenated and fed as input to the pre-trained NN model for binary classification. This technique is very computationally intensive because it requires forward passing of each pair of phrases for which similarity needs to be determined. This approach is not suitable for use cases such as call centers, where a user may insert many phrases to search, and the number of sentences in a conversation may continue to grow, resulting in an excessively high number of possible combinations. Double encoder-the method represents all phrases as embedded vectors of the same size produced by the pre-trained model. The phrase similarity can then be calculated as the cosine distance between the vectors. The technique is designed for phrases of relatively similar length and does not simply extend to searches within long text such as conversations. A simple and popular method would involve dividing text into sentences or similar units and then generating an embedding, but this would not produce satisfactory results in the natural conversation field due to the lengthy, non-canonical sentence nature of the natural conversation. In addition, the dual encoder derives the embedding from all words in the input sequence, resulting in the inability to capture higher resolution nuances. In short, the presence of irrelevant words in the input sequence requires a low similarity threshold, which in turn limits the ability to capture important differences. For example, in the context of a call center, phrases such as "I called you TWICE EARLIER (I'm has previously made two calls