US-12626013-B2 - Privacy preserving federated query engine

US12626013B2US 12626013 B2US12626013 B2US 12626013B2US-12626013-B2

Abstract

A federated query engine system and method for multiple datasets is enhanced with privacy preserving features. It may, for example, limit the movement of data from one or more of the datasets being accessed. It may use cryptographic long-term keys, enabling fuzzy table joins that do not require a comparison of the plaintext column values. The query plan may leverage the particular infrastructure of the storage system that houses each of the datasets. The query engine receives a standard SQL query, translates the query into a logical plan for performing the query across the multiple datasets, converts the logical plan into physical plans that are specific to the implementational architecture of the multiple datasets, and sends these physical plans to SQL workers located near the data warehouses housing each dataset.

Inventors

Chi Lang Ngo
Maciej Makowski
Piotr Gabryanczyk
David Gilmore
Isaac Hales

Assignees

LIVERAMP, INC.

Dates

Publication Date: 20260512
Application Date: 20221115

Claims (18)

1 . A federated query method, comprising the steps of: receiving a query statement, a schema for each of a plurality of datasets, and at least one privacy policy at a query engine, wherein each of the plurality of datasets comprises a plurality of rows and a plurality of columns; parsing the query statement into a structured form; based on the privacy policy, performing a structural transformation of the structured form of the query statement to comply with a set of privacy requirements in the privacy policy to produce a logical query plan, wherein the privacy policy comprises privacy limitations that are different between different datasets in the plurality of datasets, and wherein the structural transformation comprises applying long-term keys utilizing bloom-filter based cryptography that enables fuzzy matching in order to enable privacy-preserving table joins without requiring comparison of plaintext column values, and wherein the structural transformation comprises modifying the logical query plan to calculate counts of distinct rows contributing to aggregation groups, and the rows not meeting privacy thresholds are excluded; based on the schema for each of the plurality of datasets and the logical query plan, generating at least one physical query plan for a query at each of the plurality of datasets; and at each of a plurality of worker nodes, each of which corresponds to one of a plurality of data warehouses each housing one of the plurality of datasets, transforming the physical query plan into a query dialect appropriate to the dataset at the data warehouse that each of the plurality of worker nodes corresponds to in order to produce a translated query.
2 . The federated query method of claim 1 , wherein the step of parsing the query statement into a structured form comprises the step of parsing the query statement into a tree structure.
3 . The federated query method of claim 2 , wherein the step of parsing the query statement into a tree structure comprises the step of parsing the query statement into an abstract syntax tree.
4 . The federated query method of claim 1 , further comprising the step of receiving a location for each of the plurality of datasets.
5 . The federated query method of claim 1 , wherein the at least one physical plan comprises a plurality of physical plans.
6 . The federated query method of claim 5 , wherein each of the plurality of physical plans is applied to one of the plurality of datasets.
7 . The federated query method of claim 5 , further comprising the step of choosing a best physical plan from the plurality of physical plans.
8 . The federated query method of claim 1 , further comprising the step of running each of the translated queries against one of the plurality of datasets.
9 . The federated query method of claim 8 , further comprising the step of fetching the results of running each of the translated queries against each of the plurality of datasets back to the query engine.
10 . The federated query method of claim 9 , wherein the query engine comprises a coordinator, and further comprising the step of fetching the results of running each of the translated queries against each of the plurality of datasets back to the coordinator.
11 . The federated query method of claim 10 , further comprising the step of aggregating the results of running each of the translated queries against each of the plurality of datasets at the coordinator.
12 . The federated query method of claim 11 , further comprising the step applying a set of additional privacy constraints to the aggregated results.
13 . The federated query method of claim 1 , wherein the privacy policy comprises a query threshold value.
14 . The federated query method of claim 1 , wherein the privacy policy comprises a restriction on data movement of a portion of the data in at least one of the plurality of datasets.
15 . The federated query method of claim 14 , wherein the portion of the data in at least one of the plurality of datasets is a column in at least one of the plurality of datasets.
16 . The federated query method of claim 1 , wherein the privacy policy comprises application of data warehouse native privacy-enhancing features between joins of at least two of the plurality of datasets.
17 . The federated query method of claim 1 , wherein the step of performing a structural transformation of the structured form of the query statement to comply with a set of privacy requirements in the privacy policy to produce a logical query plan comprises the step of suppressing rows in the plurality of datasets below a threshold value.
18 . The federated query method of claim 17 , wherein the threshold value varies between at least two of the datasets in the plurality of datasets.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. provisional patent application No. 63/279,867, filed on Nov. 16, 2021. Such application is incorporated herein by reference in its entirety. BACKGROUND Modern enterprises often have large datasets stored in data warehouses hosted on different infrastructure and/or cloud storage environments. Enterprises nevertheless often wish to collaborate with each other on these datasets. One recent approach to solving this problem is federated query engines. In a typical federal query engine, data is to be extracted from multiple sources remote from each other, such as databases and data warehouses, whether found on local infrastructure or in cloud storage. Standard query language (SQL) scripts are written that fetch data from across these sources and join tables from different datasets. Examples of commercial products include Amazon Redshift and Athena used in the AWS cloud environment, as well as Google BigQuery used in the Google Cloud environment. These systems avoid the long, tedious extract, transform, load (ETL) processes required to bring data together in a shared format in order to run queries, because the data remains in place across the different data sources. A single SQL-like query can be made that pulls data across these disparate sources. Because the federated query systems perform the necessary translation, the user need not know the specific query or data language for each dataset; automated conversion allows anyone to perform queries across all of the data sources. Trino, Spark, and Dremio are examples of platforms providing federated query support. A limitation of federal query systems is that they do not include a means of controlling privacy independently across the multiple datasets being accessed. Individual datasets may, however, have different privacy requirements. These privacy requirements may be dictated by the organization that keeps the data, by the nature of the data itself, or by the laws and regulations of the jurisdiction where the data was collected and/or stored. Thus a system for providing privacy protections as a part of a federated query system is desired. References mentioned in this background section are not admitted to be prior art with respect to the present invention. SUMMARY The present invention is directed to a federated query engine system and method with privacy preserving features. In certain embodiments, the federated query engine may limit the movement of data from one or more of the datasets being accessed. For example, the owner of a particular dataset may specify that a certain column of data within the dataset may never leave its own network. Also in certain embodiments, privacy-enhancing technologies may be inserted into the query plan developed by the federated query engine to enable privacy-preserving table joins. For example, this may be performed by using cryptographic long-term keys, which is a bloom-filter-based cryptographic approach, thereby enabling fuzzy table joins that do not require a comparison of the plaintext column values. Also in certain embodiments, the query plan developed by the federated query engine may leverage the particular infrastructure of the storage system that houses each of the datasets. For example, if there are multiple datasets stored in Snowflake accounts, the plan may take advantage of the native Snowflake capability for enabling secure data sharing, while recognizing that this will not be part of the plan for joins between Snowflake and non-Snowflake dataset environments. By having awareness of both data location (for federation) and privacy constraints of each of the underlying datasets, the federated query engine system according to embodiments of the invention can rewrite/generate an optimal privacy-compliant execution plan. These and other features, objects and advantages of the present invention will become better understood from a consideration of the following detailed description of the preferred embodiments and appended claims in conjunction with the drawings as described following. DRAWINGS FIG. 1 is an architectural diagram of a system according to an embodiment of the present invention. FIG. 2 is a process flow diagram for a federated SQL coordinator according to an embodiment of the present invention. DETAILED DESCRIPTION Before the present invention is described in further detail, it should be understood that the invention is not limited to the particular embodiments described, and that the terms used in describing the particular embodiments are for the purpose of describing those particular embodiments only, and are not intended to be limiting, since the scope of the present invention will be limited only by the claims. With reference to FIG. 2, an overview of the flow for use of an embodiment of the present invention as a privacy-preserving federated SQL engine may work as follows. At SQL text step 10, the engine takes as its