US-12626158-B1 - Automated contribution analysis for question answering
Abstract
This disclosure describes techniques and architecture provide automated contribution analysis for “why question” style NLQ answering, e.g., “why is revenue down in North America Q1 2022.” In particular, the techniques described herein combine multiple signals together including, for example, frequency of use of combinations of dimensions in previous NLQs (warm-start), statistical information about columns (e.g., entropy), correlation/co-occurrence between pairs of dimension columns, and correlation between dimensions and dates. This information is used with a set of heuristics and rules to pick the best set of dimensions as contributing factors for a particular metric over a particular time period and present an automatic contribution analysis to the users to give them insights into their data.
Inventors
- Wojciech Aleksander Wilk
- Shannon Kalisky
- Rishav Chakravarti
- William Michael Siler
- Stephen Michael Ash
- Rajesh Patel
- Joshua Noah Malters
- Gregory David Adams
- Jose Kunnackal John
Assignees
- AMAZON TECHNOLOGIES, INC.
Dates
- Publication Date
- 20260512
- Application Date
- 20221128
Claims (20)
- 1 . A computer-implemented method comprising: providing a dataset arranged in tabular form comprising rows and columns, wherein each column has a name representing a dimension; identifying a topic related to a description of the dataset based on user input; based on the topic, triggering a contributing dimension recommendation workflow for a contribution analysis to determine ranks of one or more dimensions of the dataset that contribute to answering why-type natural language questions (NLQs), wherein the contribution dimension recommendation workflow comprises: analyzing, using a knowledge discovery in databases (KDD) application programming interface (API), one or more heuristics comprising scanning visuals displayed to users related to the dataset, scanning previously asked questions, computing entropy of a distribution of values for a plurality of dimensions of the dataset, previously run anomaly detections, co-occurrence between measures and the dimensions using logistic regression, and frequency of use of the dimensions in previous contribution analyses; based on the analyzing, scoring each dimension; based on each score, ranking each dimension; and based on ranking each dimension, recommending the one or more dimensions for use in answering why-type NLQs; receiving, from a user, a why-type NLQ; based on intent representation (IR) with respect to the why-type NLQ, selecting a metric related to the why-type NLQ; based on the one or more dimensions and the metric, searching the dataset for values related to the metric; and based on the searching, providing results to the user with respect to the why-type NLQ.
- 2 . The computer-implemented method of claim 1 , wherein the metric has an aggregation of values that is one of sum, average, or count.
- 3 . The computer-implemented method of claim 1 , further comprising: excluding, from the one or more dimensions, dimensions having a cardinality that is greater than or equal to a first threshold of sample size from the dataset and that is less than a second threshold that is less than the first threshold.
- 4 . The computer-implemented method of claim 1 , further comprising: determining strongly-correlated dimensions with a co-occurrence greater than or equal to a threshold with respect to another dimension; based on the strongly-correlated dimensions, determining a set of dimensions with a cardinality that are within a cardinality range; selecting one dimension of the set of dimensions with a highest score; and adding values from the one dimension to the results.
- 5 . The computer-implemented method of claim 1 , further comprising: determining strongly-correlated dimensions with a co-occurrence greater than or equal to a threshold with respect to another dimension; based on the strongly-correlated dimensions, determining a set of dimensions with a cardinality that are within a cardinality range; selecting one dimension of the set of dimensions with a highest score; and adding values from the one dimension to the results.
- 6 . The computer-implemented method of claim 1 , wherein the dimensions are columns of tables.
- 7 . A computer-implemented method comprising: based at least in part on a topic related to a dataset, triggering a contribution dimension recommendation workflow for a contribution analysis to determine ranks of one or more dimensions of the dataset that contribute to answering why-type natural language questions (NLQs), wherein the contribution dimension recommendation workflow comprises: analyzing one or more heuristics; ranking individual dimensions of the dataset; and based at least in part on ranking the individual dimensions, recommending the one or more dimensions for use in answering why-type NLQs; receiving, from a user, a why-type NLQ; based at least in part on intent representation (IR) with respect to the why-type NLQ, selecting a metric related to the why-type NLQ; based at least in part on the one or more dimensions and the metric, searching the dataset for one or more values related to the metric; and based at least in part on the searching, providing results to the user with respect to the why-type NLQ.
- 8 . The computer-implemented method of claim 7 , wherein the metric has an aggregation of values that is one of sum, average, or count.
- 9 . The computer-implemented method of claim 7 , further comprising: if a dimension of the one or more dimensions includes a filter, eliminating the dimension from the one or more dimensions.
- 10 . The computer-implemented method of claim 7 , further comprising: excluding, from the one or more dimensions, dimensions with very high cardinality more than or equal to 95% of sample size from the dataset and very low (0, 1) cardinality.
- 11 . The computer-implemented method of claim 7 , wherein the heuristics comprise one or more heuristics comprising scanning visuals displayed to users related to the dataset, scanning previously asked questions, computing entropy of a distribution of values for individual dimensions of the dataset, previously run anomaly detections, co-occurrence between measures and dimensions using logistic regression, and frequency of use of the dimensions in previous contribution analyses.
- 12 . The computer-implemented method of claim 11 , wherein the contributing dimension recommendation workflow for the contribution analysis is performed by a knowledge discovery in databases (KDD) application programming interface (API).
- 13 . The computer-implemented method of claim 7 , wherein at least part of the contributing dimension recommendation workflow for the contribution analysis occurs during creation of the topic.
- 14 . The computer-implemented method of claim 13 , wherein at least part of the contributing dimension recommendation workflow for the contribution analysis occurs offline during creation of the topic.
- 15 . One or more computer-readable media storing computer-executable instructions that, when executed, cause one or more processors to perform operations comprising: based at least in part on a topic related to a dataset, triggering a contributing dimension recommendation workflow for a contribution analysis to determine ranks of one or more dimensions of the dataset that contribute to answering why-type natural language questions (NLQs), wherein the dimension recommendation workflow comprises: analyzing one or more heuristics; ranking individual dimensions of the dataset; and based at least in part on ranking the individual dimensions, recommending the one or more dimensions for use in answering why-type NLQs; receiving, from a user, a why-type NLQ; based at least in part on intent representation (IR) with respect to the why-type NLQ, selecting a metric related to the why-type NLQ; based at least in part on the one or more dimensions and the metric, searching the dataset for one or more values related to the metric; and based at least in part on the searching, providing results to the user with respect to the why-type NLQ.
- 16 . The one or more computer-readable media of claim 15 , wherein the metric has an aggregation of values that is one of sum, average, or count.
- 17 . The one or more computer-readable media of claim 15 , wherein the operations further comprise: if a dimension of the one or more dimensions includes a filter, eliminating the dimension from the one or more dimensions.
- 18 . The one or more computer-readable media of claim 15 , wherein the heuristics comprise one or more heuristics comprising scanning visuals displayed to user related to the dataset, scanning previously asked questions, computing entropy of a distribution of values for the individual dimensions of the dataset, previously run anomaly detections, co-occurrence between measures and dimensions using logistic regression, and frequency of use of the dimensions in previous contribution analyses.
- 19 . The one or more computer-readable media of claim 18 , wherein the contributing dimension recommendation workflow for the contribution analysis is performed by a knowledge discovery in databases (KDD) application programming interface (API).
- 20 . The one or more computer-readable media of claim 19 , wherein: at least part of the contributing dimension recommendation workflow for the contribution analysis occurs during creation of the topic; and at least part of the contributing dimension recommendation workflow for the contribution analysis occurs offline during creation of the topic.
Description
BACKGROUND Service provider networks may provide cloud based computing services that may include a cloud-scale business intelligence (BI) service. Such a cloud-scale BI service may be used to deliver easy to understand insights to members of various teams and groups no matter where the individuals of the teams and groups are located. Such cloud-scale BI services connect to business data in the cloud and may combine data from many different sources. In a single data dashboard, the BI service may include data from the cloud based computing service, third party data, big data, spreadsheet data, software as a service (SaaS) data, business-to-business (B2B) data, etc. The data within the cloud-scale BI service is generally in tabular form and may be searched and presented in various dashboard forms using key word searches. Currently, cloud-scale BI services do not adequately support natural language question (NLQ) requests. The BI service may be used to allow BI analysists and engineers to collect data within their data warehouses and data silos across data links to produce data dashboards and reports to be presented to people without technical experience and/or understanding, e.g., business people. By allowing NLQs, such people may ask questions and curate how to view data. However, interpreting people's intentions based on what they type and translating that intent into what physical data is available despite the schemas can be difficult. Thus, for NLQs it is necessary to transfer semantics to understand the underlying data. This results in a semiautomated process of transferring arbitrary, disorganized, data schemas and the corresponding data into enough of a schematic representation of what is in the data so that people may access the data using NLQs. If the context around the NLQs is missing, then it is difficult to determine the domain that is needed to answer the NLQs since NLQs are often worded in a way that a non-technical person would understand. Thus, it is difficult to access the data with NLQs. Additionally, there may be limits as to which people, groups, teams, etc., may access some of the data. Thus, when an NLQ is presented, an answer should not be provided to a person, group, or team that does not have access to the answer. For example, if a person types in “secret project” and the answer comes back with “secret project XYZ,” then the person now knows secret project XYZ exists. If this person is not authorized to know the existence of secret project XYZ, then this is a problem. This can be an important aspect of BI services in that there are organizations that are providing services that may eventually embed an NLQ service feature within their BI service and then provide the BI service to multiple organizations. If a person at one organization types in an NLQ and an answer comes back related to a different organization, then a breach of privacy and security may result. Furthermore, with the datasets, it can be difficult to know what to analyze and/or consider in the dataset in order to answer the NLQs. Interpreting a user's intent when analyzing an NLQ for obtaining an answer can be difficult. In order to answer the NLQ, it is desirable to provide both a quantitative and qualitative answer and not just a large amount of numbers. Also, providing an NLQ search feature within a cloud scale BI service can also be difficult because users are generally comfortable with key word searching. Thus, it may be necessary to teach people how to obtain insights out of the data. With an NLQ, this can depend on a user's inference. Users may struggle how to form proper NLQs since the users, e.g., readers, typically are not familiar with what is contained within the data. Authors or administrators of the data generally are familiar with what is included in the data since authors and administrators are generally the ones who input and initially organize the data within the cloud scale BI service. BRIEF DESCRIPTION OF THE DRAWINGS The detailed description is set forth below with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items. The systems depicted in the accompanying figures are not to scale and components within the figures may be depicted not to scale with each other. FIG. 1 schematically illustrates a system-architecture diagram of an example service provider network that includes a business intelligence service and an NLQ query service within the service provider network for verifying and validating documents associated with establishing a business account with the service provider network. FIG. 2 schematically illustrates an example of some of the components of a model pipeline for a query execution process for NLQ processing within the NLQ query service of FIG. 1. FIG. 3 schematically illustrates an