CA-3131157-C - SYSTEM AND METHOD FOR TEXT CATEGORIZATION AND SENTIMENT ANALYSIS
Abstract
A system and method for improved categorization and sentiment analysis which is fed textual data such as transcriptions or collated data from a network enabled service, or some other source, which then segments textual data into chunks, parses the data chunks, and analyzes it using a plurality of techniques and metadata gathering methods to determine the sentiment of participating individuals concerning entities mentioned in the textual data and to categorize the discussions, for the purpose of taking actions to improve business outcomes.
Inventors
- Jonathan Kershaw
- Ashley Unitt
- Alan McCord
Assignees
- NEWVOICEMEDIA US INC.
Dates
- Publication Date
- 20260505
- Application Date
- 20200224
- Priority Date
- 20190222
Claims (1)
- 26 What is claimed is: 1. A system for categorization and sentiment analysis, comprising a computing device running the following software modules: a chunk parser for receiving input text; and 5 breaking the input text into chunks of text comprising words and phrases; a chunk sentiment analyzer which receives the chunks of text from the chunk parser, assigns a sentiment to each chunk of text, and passes each chunk with its assigned sentiment to a deterministic rules engine; the deterministic rules engine which: 10 categorizes each chunk of text into a first set of semantic categories using regular expression rules; and for chunks of text where no regular expression rule is found for categorization into the first set of semantic categories, passes those chunks of text to a semantic similarity engine; the semantic similarity engine which: 15 adds a vector to each chunk of text received from the deterministic rules engine representing the semantic characteristics of that chunk of text; categorizes the chunks of text into a second set of semantic categories based on a threshold semantic distance from one or more category anchor vectors; and for chunks of text where no match is found for categorization into the second set of 20 semantic categories, passes those chunks of text to a semantic cluster discovery engine; a semantic cluster discovery engine for categorizing chunks of text received from the semantic similarity engine into a third set of semantic categories based on their clustering relative to one another, for those chunks of text which do not fall within the threshold distance from any of the one or more category anchor vectors; and 25 a category comparator and integrator for comparing the first, second, and third sets of semantic categories to identify contextual associations between the chunks of text in each semantic category; and calculating the sentiment for the input text based on the contextual associations. 27 2. The system of claim 1, further comprising a sequence reducer and embedder for after sentiment has been calculated on each chunk of text, reducing each chunk of text further into a sequence of words which preserves the order of words from the input text; embedding each input word sequence into a vector according to a chosen sequence embedding 5 model. 3. The system of claim 1, further comprising a trend analyzer for as additional input texts are received, analyzing and displaying: the number of and proportion of texts in each category; 10 the growth or decline of categories over time; and an automated management alert when an emerging category grows at or above a threshold rate. 4. The system of claim 1, further comprising a supervised machine learning algorithm for 15 analyzing the attributes of the input text, the categories, and calculated sentiment; and predicting a combination of attributes which will result in a given sentiment. 5. A method for categorization and sentiment analysis, comprising the steps of: using a chunk parser operating on a computing device comprising a memory and a processor 20 to perform the steps of: receiving an input text; breaking the input text into chunks of text comprising words and phrases; using a chunk sentiment analyzer operating on the computing device to perform the steps of: receiving the chunks of text from the chunk parser; 25 assigning a sentiment to each chunk of text; and passing each chunk with its assigned sentiment to a deterministic rules engine; using the deterministic rules engine operating on the computing device to perform the steps of: categorizing each chunk of text into a first set of semantic categories using regular expression rules; and 30 CA 3131157 28 for chunks of text where no regular expression rule is found for categorization into the first set of semantic categories, passing those chunks of text to a semantic similarity engine; using the semantic similarity engine operating on the computing device to perform the steps of: adding a vector to each chunk of text received from the deterministic rules engine representing the semantic characteristics of that chunk of text; 5 categorizing the chunks of text into a second set of semantic categories based on a threshold semantic distance from one or more category anchor vectors; and for chunks of text where no match is found for categorization into the second set of semantic categories, passing those chunks of text to a semantic cluster discovery engine; using the semantic cluster discovery engine operating on the computing device to perform the 10 steps of: categorizing those chunks of text received from the semantic similarity engine into a third set of semantic categories based on their clustering relative to one another, for chunks of text which do not fall within the threshold distance from any of the one or more category anchor vectors; using a category comparator and integrator operating on the computing device to perform the 15 steps of comparing the first, second, and third sets of semantic categories to identify contextual associations between the chunks of text in each semantic category; and calculating the sentiment for the input text based on the contextual associations. 6. The method of claim 5, further comprising the steps of: after sentiment has been calculated on each chunk of text, reducing each chunk of text further into a sequence of words which preserves the order of words from the input text; and embedding each input word sequence into a vector according to a chosen sequence embedding model. 25 7. The method of claim 5, further comprising the steps of: as additional input texts are received, analyzing and displaying: the number of and proportion of texts in each category; the growth or decline of categories over time; and 30 CA 3131157 29 an automated management alert when an emerging category grows at or above a threshold rate. 8. The method of claim 5, further comprising the steps of: analyzing the attributes of the input text, the categories, and calculated sentiment using a 5 machine learning algorithm; and predicting a combination of attributes which will result in a given sentiment.
Description
1 SYSTEM AND METHOD FOR TEXT CATEGORIZATION AND SENTIMENT ANALYSIS CROSS-REFERENCE TO RELATED APPLICATIONS Application No. Date Filed Title Current application Herewith SYSTEM AND METHOD FOR TEXT CATEGORIZATION AND SENTIMENT ANALYSIS Is a PCT filing of, and claims priority to: 16/794,162 Feb. 18, 2020 SYSTEM AND METHOD FOR TEXT CATEGORIZATION AND SENTIMENT ANALYSIS which is a continuation of: 16/283,447 Patent: 10,565,244 Feb. 22, 2019 Issue Date: Feb. 18, 2020 SYSTEM AND METHOD FOR TEXT CATEGORIZATION AND SENTIMENT ANALYSIS 5 BACKGROUND OF THE INVENTION Field of the Art [001] The disclosure relates to the field of information processing, and more particularly to the field of analyzing provided text representing conversations to analyze them for sentiments and performing categorizations that make a response to expressed sentiment actionable. 10 Discussion of the State of the Art [002] It is currently commonplace in textual analysis, to use regular expressions with dictionaries of words and databases of common or anticipated nouns, to perform simple lookups and pattern-matches to loosely categorize subject matter and sentiment during a conversation or from a text sample provided to a given system. This may be done to analyze 15 the sentiments of people communicating on message boards on the Internet, or to gauge them during text conversations with chatbots online such as for customer service purposes, or this may be done for information collecting purposes for law enforcement and human resources organizations, and even to detect unwanted messages in services such as email and WO 2020/172649 PCT/0S2020/019438 text messaging services as well as the sentiment expressed in conversations with contact center agents on various topics, e.g. relating to sales and service. [003] \Vhile current efforts for computing categorization and sentiment from text may be able to gauge user sentiment with some degree of accuracy some of the time, there is 5 considerable lack of detail and a considerable margin for error in many cases using current simplistic systems. Emails may be sometimes erroneously gauged as spam, texts or messages on social networks and message-boards may be erroneously flagged for moderation or deletion, or their content may be inaccurately gauged for users searching for specific forms of content. 10 [004] \Vhat is needed is a system which will analyze the sentiment of a piece of conversational text with high accuracy (precision and recall) and do so within the context of user-defined categories and to monitor the change over time of the distribution of textual data that falls within each category together with its sentiment. Furthermore, a system is needed that can also discover the emergence of new categories automatically without them having to 15 be pre-defined. SUMMARY OF THE INVENTION [005] Accordingly, the inventor has conceived, and reduced to practice, a system and method for improved categorization and sentiment analysis. [006] A system for categorization and sentiment analysis is disclosed, comprising: a chunk 20 parser comprising at least a plurality of programming instructions stored in a memory and operating on at least one processor of a computer, wherein the programmable instructions, when operating on the at least one processor, cause the at least one processor to: receive input in text form; break the text into chunks of text comprising words and phrases; and compute sentiment on the text at the chunk level; and a deterministic rules engine comprising at least a 25 plurality of programming instructions stored in a memory and operating on at least one processor of a computer, wherein the programmable instructions, when operating on the processor, cause the processor to: categorize the text into pre-defined categories using regular expression rules and store the categorization; if no regular expression rule is matched, forward the chunked text to a semantic similarity engine; and a semantic similarity engine 30 comprising at least a plurality of programming instructions stored in a memory and operating on at least one processor of a computer, wherein the programmable instructions, when operating on the at least one processor, cause the at least one processor to: receive chunked 2 WO 2020/172649 PCT/0S2020/019438 text; represent each chunk of text as a vector embedded in a high dimensional space representing semantic characteristics of the chunked text; categorize the chunked text into pre-defined categories using a threshold semantic similarity distance (hypersphere radius) from any of a set of pre-defined anchor word sequences for each category; and if no 5 sufficiently close match is found to any pre-defined category anchor word sequences, forward the chunked text with embedded vector dimensions to a semantic cluster discovery engine; and a semantic cluster discovery engine comprising at least a plurality of programming instructions stored in a memory and operating on at least one processo