US-12625872-B2 - Generating overlap queries on a database system

US12625872B2US 12625872 B2US12625872 B2US 12625872B2US-12625872-B2

Abstract

An advanced system for refining overlap queries in a database system based on user feedback. The system monitors interactions of a first user with a first dataset on the database system, where the first dataset is associated with the first user. Feedback regarding the quality of a results dataset, generated from an executed overlap query, is received from the first user. This feedback informs the generation of a similarity score dataset that enhances the creation of new overlap queries. These new overlap queries are designed to output refined overlap datasets between the first dataset and a second dataset associated with a second user. A new joined dataset is generated by executing these overlap queries, comprising data from both the first and second datasets. A new results dataset is generated, providing the first user with refined recommendations based on additional feedback.

Inventors

Matthew J. Glickman
Orestis KOSTAKIS
Justin Langseth

Assignees

SNOWFLAKE INC.

Dates

Publication Date: 20260512
Application Date: 20240509

Claims (20)

1 . A method comprising: monitoring, by at least one hardware processor, a first user interaction with a first dataset on a database system, the first dataset being associated with a first user; receiving, from the first user, feedback regarding a quality of a results dataset generated from an executed overlap query, the results dataset comprising a joined dataset of data from the first dataset of the first user and data from a second dataset of a second user; based on the feedback, generating a similarity score dataset indicating a similarity between the first dataset and the second dataset; generating, using the similarity score dataset, a new set of overlap queries specifying join operations between the first dataset and the second dataset; executing the new set of overlap queries to materialize a new joined dataset comprising the data from the first dataset and the data from the second dataset, the materializing comprises persisting the new joined dataset as a physical table indexed and stored in the database system for subsequent reuse; generating a new results dataset by applying the new set of overlap queries to the new joined dataset; and providing, to the first user, the new results dataset comprising a refinement recommendation for the overlap queries based on additional user feedback, the refinement recommendation comprising a ranked set of proposed modifications to at least one join operation associated with the overlap queries.
2 . The method of claim 1 , wherein each overlap query is generated based on non-matching data between the first dataset and a corresponding dataset of the new joined dataset, the non-matching data comprises non-matching columns between the first dataset and the new joined dataset.
3 . The method of claim 2 , wherein generating the new set of overlap queries comprises identifying, from an overlap query repository, an overlap query that is associated with the non-matching data.
4 . The method of claim 1 , further comprising: prior to generating the new joined dataset, executing a data cleaning process on the first dataset and the second dataset to ensure data quality and consistency.
5 . The method of claim 1 , wherein the feedback from the first user is received via an interactive user interface and further comprises: enabling the first user to rate the quality of the results dataset on a predefined scale.
6 . The method of claim 1 , wherein the new set of overlap queries is generated based on a machine learning model that predicts potential utility of overlap datasets based on historical user feedback and query outcomes.
7 . The method of claim 6 , wherein the machine learning model is retrained periodically with new user feedback and results datasets to improve accuracy of the new set of overlap queries.
8 . The method of claim 1 , further comprising: categorizing each dataset in the database system by a semantic type of a data field within each dataset, wherein the semantic type includes at least one of a ZIP code, a date, or a numerical identifier.
9 . The method of claim 8 , wherein generating the similarity score dataset includes comparing the semantic type of the first dataset with the semantic type of the second dataset to identify matching and non-matching data fields.
10 . A system comprising: one or more hardware processors; and one or more computer-readable mediums storing instructions that, when executed by the one or more hardware processors, cause the system to perform operations comprising: monitoring a first user interaction with a first dataset on a database system, the first dataset being associated with a first user; receiving, from the first user, feedback regarding a quality of a results dataset generated from an executed overlap query, the results dataset comprising a joined dataset of data from the first dataset of the first user and data from a second dataset of a second user; based on the feedback, generating a similarity score dataset indicating a similarity between the first dataset and the second dataset; generating, using the similarity score dataset, a new set of overlap queries specifying join operations between the first dataset and the second dataset; executing the new set of overlap queries to materialize a new joined dataset comprising the data from the first dataset and the data from the second dataset, the materializing comprises persisting the new joined dataset as a physical table indexed and stored in the database system for subsequent reuse; generating a new results dataset by applying the new set of overlap queries to the new joined dataset; and providing, to the first user, the new results dataset comprising a refinement recommendation for the overlap queries based on additional user feedback, the refinement recommendation comprising a ranked set of proposed modifications to at least one join operation associated with the overlap queries.
11 . The system of claim 10 , wherein each overlap query is generated based on non- matching data between the first dataset and a corresponding dataset of the new joined dataset, the non-matching data comprises non-matching columns between the first dataset and the new joined dataset.
12 . The system of claim 10 , the operations further comprising: prior to generating the new joined dataset, executing a data cleaning process on the first dataset and the second dataset to ensure data quality and consistency.
13 . The system of claim 10 , wherein the feedback from the first user is received via an interactive user interface and the operations further comprising: enabling the first user to rate the quality of the results dataset on a predefined scale.
14 . The system of claim 10 , wherein the new set of overlap queries is generated based on a machine learning model that predicts potential utility of overlap datasets based on historical user feedback and query outcomes, the machine learning model is retrained periodically with new user feedback and results datasets to improve accuracy of the new set of overlap queries.
15 . The system of claim 10 , the operations further comprising: categorizing each dataset in the database system by a semantic type of a data field within each dataset, wherein the semantic type includes at least one of a ZIP code, a date, or a numerical identifier; and comparing the semantic type of the first dataset with the semantic type of the second dataset to identify matching and non-matching data fields.
16 . A machine-readable storage device embodying instructions that, when executed by a machine, cause the machine to perform operations comprising: monitoring a first user interaction with a first dataset on a database system, the first dataset being associated with a first user; receiving, from the first user, feedback regarding a quality of a results dataset generated from an executed overlap query, the results dataset comprising a joined dataset of data from the first dataset of the first user and data from a second dataset of a second user; based on the feedback, generating a similarity score dataset indicating a similarity between the first dataset and the second dataset; generating, using the similarity score dataset, a new set of overlap queries specifying join operations between the first dataset and the second dataset; executing the new set of overlap queries to materialize a new joined dataset comprising the data from the first dataset and the data from the second dataset, the materializing comprises persisting the new joined dataset as a physical table indexed and stored in the database system for subsequent reuse; generating a new results dataset by applying the new set of overlap queries to the new joined dataset; and providing, to the first user, the new results dataset comprising a refinement recommendation for the overlap queries based on additional user feedback, the refinement recommendation comprising a ranked set of proposed modifications to at least one join operation associated with the overlap queries.
17 . The machine-readable storage device of claim 16 , wherein each overlap query is generated based on non-matching data between the first dataset and a corresponding dataset of the new joined dataset, the non-matching data comprises non-matching columns between the first dataset and the new joined dataset.
18 . The machine-readable storage device of claim 16 , the operations further comprising: prior to generating the new joined dataset, executing a data cleaning process on the first dataset and the second dataset to ensure data quality and consistency.
19 . The machine-readable storage device of claim 16 , wherein the feedback from the first user is received via an interactive user interface and the operations further comprising: enabling the first user to rate the quality of the results dataset on a predefined scale.
20 . The machine-readable storage device of claim 16 , the operations further comprising: categorizing each dataset in the database system by a semantic type of a data field within each dataset, wherein the semantic type includes at least one of a ZIP code, a date, or a numerical identifier; and comparing the semantic type of the first dataset with the semantic type of the second dataset to identify matching and non-matching data fields.

Description

PRIORITY CLAIM This application is a Continuation of U.S. patent application Ser. No. 17/804,434, filed May 27, 2022, the contents of which are hereby incorporated by reference. TECHNICAL FIELD The present disclosure generally relates to efficiently managing combining data in a distributed database. BACKGROUND Databases are used for data storage and access in computing applications. A goal of database storage is to provide enormous sums of information in an organized manner so that it can be accessed, managed, and updated. In a database, data may be organized into rows, columns, and tables. Data from different datasets can be combined to form a combined dataset which can be queried. BRIEF DESCRIPTION OF THE DRAWINGS Various ones of the appended drawings merely illustrate example embodiments of the present disclosure and should not be considered as limiting its scope. FIG. 1 illustrates an example computing environment in which a network-based database system can implement an overlap data system, according to some example embodiments. FIG. 2 is a block diagram illustrating components of a compute service manager, according to some example embodiments. FIG. 3 is a block diagram illustrating components of an execution platform, according to some example embodiments. FIGS. 4A and 4B show data architectures for implementing overlap queries, according to some example embodiments. FIG. 5 shows a flow diagram of a method for implementing overlap queries, according to some example embodiments. FIG. 6 illustrates a diagrammatic representation of a machine in the form of a computer system within which a set of instructions may be executed for causing the machine to perform any one or more of the methodologies discussed herein, in accordance with some embodiments of the present disclosure. DETAILED DESCRIPTION The description that follows includes systems, methods, techniques, instruction sequences, and computing machine program products that embody illustrative embodiments of the disclosure. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide an understanding of various embodiments of the inventive subject matter. It will be evident, however, to those skilled in the art, that embodiments of the inventive subject matter may be practiced without these specific details. In general, well-known instruction instances, protocols, structures, and techniques are not necessarily shown in detail. In some example embodiments, a distributed database can provide each user controlled access to datasets for other users in a data marketplace or anonymizing data clean room. Data analyst users that work with data in order to extract business value strive to enrich their data by combining other data sources using the shared access provided by the distributed database. For example, a user may browse and search for useful datasets and “join” (e.g., perform a SQL JOIN operation on, followed by a SQL SELECT on) them with their original data that they want to enrich. A goal for users of the shared datasets is that by searching and browsing the available shared datasets, the user will discover data stored in tables that contain fields (columns) similar/identical to those in their original data tables (to be able to join), and that they will be able to get novel information from the non-common fields. However, in a data marketplace with a large number of datasets, finding datasets that are joinable is a cumbersome task. Even if the user comes across a suitable dataset, the user then must expend time, effort, and computational resources to confirm if it is a “joinable” dataset (for example, to confirm that the user's data and the potential target data has a congruent schema, that the user's data and the potential target data will produce matches, and each dataset has key fields that match so that when the datasets are joined there will be matches). To this end, an overlap data system is configured to identify problems to a given user that are solvable using overlap queries on an overlap dataset (e.g., recommend to users the actual problems that they can solve via overlaps, if they were to enrich (e.g., join) their existing data, with some of the suitable datasets that the overlap data system has automatically identified as being congruent datasets with which the user can join their data). As an example, the overlap data system first categorizes the datasets. For instance, for each dataset (e.g., table) in the shared data access database service (e.g., a web-browsable data marketplace of the distributed database), for each column, the overlap data system identifies the semantic type (e.g., while a column-name may be “LOCATION,” the data type is a string or number, but the semantic type is actually a ZIP code). After semantic categorization of the schema (e.g., columns), overlap data system analyzes a given users' existing data (e.g., tables) in a distributed database by extrac