US-12619934-B2 - Specialized computing environment for co-analysis of proprietary data

US12619934B2US 12619934 B2US12619934 B2US 12619934B2US-12619934-B2

Abstract

A specialized computing environment that includes hardware and data security features to enable competitive organizations to co-analyze proprietary data without revealing the underlying proprietary data to unauthorized users. Proprietary data are stored in volatile memory, which may be automatically erased according to pre-stored criteria. The analysis is performed automatically by a processing unit without human intervention. Analytical results are sanitized (e.g., using data masking) to prevent the analytical result from being tracible to any particular data source. Sanitized analytical results are output without outputting the underlying proprietary data (except to users authorized to validate analytical results). The computing environment is enclosed within a secure enclosure (e.g., a steel box with a lock), does not include any peripheral devices outside the secure enclosure, does not communicate wirelessly, and does not have hardware ports accessible from outside the secure enclosure (except, in some embodiments, a wired connection for a web server).

Inventors

David Hartley
Ophir Frieder

Assignees

GEORGETOWN UNIVERSITY

Dates

Publication Date: 20260505
Application Date: 20230509

Claims (19)

1 . A system for co-analyzing proprietary data while preventing distribution of the proprietary data, the system comprising: a secure enclosure with a door and a lock; and a computing environment, within the secure enclosure, comprising: non-transitory volatile memory that stores the proprietary data from each of a plurality of proprietary data sources; non-transitory system memory that stores software modules for co-analyzing the proprietary data; and a processing unit that: co-analyzes the proprietary data by executing instructions stored on the non-transitory system memory without human intervention to form an analytical result; sanitizes the analytical result by changing one or more data elements in the analytical result to form a sanitized analytical result and prevent the sanitized analytical result from being tracible to any data type or any of the proprietary data sources; and outputs the sanitized analytical result for transmittal to one of the proprietary data sources; wherein the processing unit co-analyzes the proprietary data by either: coding the proprietary data according to an ontology, populating a multi-dimensional ontology space by adding points in the ontology space that correspond to ontological vectors found in the proprietary data, using an optimization algorithm to identify populated neighborhoods in the ontology space, and identifying one or more hypotheses corresponding to one or more of the populated neighborhoods in the ontology space; or analyzing the proprietary data to construct one or more numerical metrics, identifying a baseline for each of the one or more numerical metrics, receiving additional documents, analyzing the additional documents to identify one or more updated numerical metrics, and identifying one or more updated numerical metrics that deviate from the baseline.
2 . The system of claim 1 , wherein the processing unit sanitizes the analytical result using hypothesis obfuscation, providing functionality for the proprietary data sources to attach tags to elements in the proprietary data that trigger a specific cleaning action, providing functionality to encode messages in the proprietary data that are meaningful only in combination with a particular locally-resident data or profile, or using a data masking technique.
3 . The system of claim 1 , wherein the processing unit stores the analytical result in the volatile memory.
4 . The system of claim 1 , wherein the processing unit stores the sanitized analytical result in the system memory.
5 . The system of claim 1 , wherein the processing unit erases the proprietary data stored in the volatile memory according to data vanishing criteria stored in the system memory.
6 . The system of claim 1 , wherein the processing unit outputs the sanitized analytical result without outputting the proprietary data to unauthorized users.
7 . The system of claim 6 , wherein the processing unit further provides functionality for authorized users to view the proprietary data used by the processing unit to form the sanitized analytical result.
8 . The system of claim 7 , wherein the proprietary data from a respective proprietary data source from among the plurality of proprietary data sources includes at least one data field that identifies the respective proprietary data source and the computing environment provides functionality for only authorized users to view the respective proprietary data source stored in the at least one data field.
9 . The system of claim 1 , wherein the computing environment provides functionality for the proprietary data sources to transmit the proprietary data for storage in the volatile memory via a web server over the Internet.
10 . The system of claim 1 , wherein the computing environment provides functionality for the proprietary data sources to transmit the proprietary data for storage in the volatile memory via a hardware port that is only accessible when the door of the secure enclosure is open.
11 . The system of claim 1 , further comprising: an individual input port for each of the proprietary data sources to transmit the proprietary data.
12 . The system of claim 11 , further comprising: an encrypted input isolator for each of the plurality of proprietary data sources that enforce one-way data flow.
13 . The system of claim 12 , wherein the encrypted input isolators further perform formatting processes such that the proprietary data is stored in the volatile memory using a defined structure and format.
14 . The system of claim 13 , wherein: the computing environment provides functionality for the proprietary data sources to transmit encrypted proprietary data; and the processing unit decrypts the encrypted proprietary data.
15 . The system of claim 1 , wherein: the processing unit encrypts the sanitized analytical result and outputs the encrypted sanitized analytical result for transmittal to one of the proprietary data sources.
16 . The system of claim 15 , further comprising: a hardware adapter for each of the plurality of proprietary data sources that decrypts the encrypted sanitized analytical result.
17 . The system of claim 1 , wherein: the computing environment does not include any peripheral input devices or peripheral output devices outside the secure enclosure; the computing environment does not communicate wirelessly when locked in the secure enclosure; and the secure enclosure prevents access to hardware ports of the computing environment when the door is closed.
18 . A method of co-analyzing proprietary data while preventing distribution of the proprietary data, the method comprising: receiving the proprietary data from each of a plurality of proprietary data sources and storing the proprietary data in non-transitory volatile memory; co-analyzing the proprietary data by a processing unit executing instructions stored on non-transitory system memory without human intervention to form an analytical result, wherein the proprietary data is co-analyzed by either: coding the proprietary data according to an ontology, populating a multi-dimensional ontology space by adding points in the ontology space that correspond to ontological vectors found in the proprietary data, using an optimization algorithm to identify populated neighborhoods in the ontology space, and identifying one or more hypotheses corresponding to one or more of the populated neighborhoods in the ontology space; or analyzing the proprietary data to construct one or more numerical metrics, identifying a baseline for each of the one or more numerical metrics, receiving additional documents, analyzing the additional documents to identify one or more updated numerical metrics, and identifying one or more updated numerical metrics that deviate from the baseline; sanitizing the analytical result by changing one or more data elements in the analytical result to form a sanitized analytical result and prevent the sanitized analytical result from being tracible to any data type or any of the proprietary data sources; and transmitting the sanitized analytical result to one of the proprietary data sources over a communication channel.
19 . A system for co-analyzing proprietary data while preventing distribution of the proprietary data, the system comprising: a secure enclosure with a door and a lock; and a computing environment, within the secure enclosure, comprising: non-transitory volatile memory that stores the proprietary data from each of a plurality of proprietary data sources; an encrypted input isolator for each of the proprietary data sources that: enforces one-way data flow; and performs formatting processes such that the proprietary data is stored in the volatile memory using a defined structure and format; non-transitory system memory that stores software modules for co-analyzing the proprietary data; and a processing unit that: co-analyzes the proprietary data by executing instructions stored on the non-transitory system memory without human intervention to form an analytical result; sanitizes the analytical result by changing one or more data elements in the analytical result to form a sanitized analytical result and prevent the sanitized analytical result from being tracible to any data type or any of the proprietary data sources; and outputs the sanitized analytical result for transmittal to one of the proprietary data sources.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This is a continuation of U.S. patent application Ser. No. 16/663,547, filed Oct. 25, 2019, the entire contents of which is hereby incorporated by reference in its entirety. FEDERAL FUNDING None BACKGROUND The systematic monitoring of “big data” to gain robust understanding within a particular domain is a pillar of modern commerce, research, security, health care, and other fields. Governments and other organizations seek situational awareness, real-time indications and warnings, and short- to long-term forecasting. If properly analyzed, even publicly-available open source data (that is seemingly benign) can be used to identify leading indicators of events of interest to those governments and organizations. Additionally, organizations may have access to proprietary information that, if properly analyzed, can offer insight into the domain of the organization. However, the total amount of digital information publicly available on global networks is increasing exponentially and cannot be manually reviewed, even by a large group of humans, to quickly identify all relevant data for a given subject or project. The demand for processing large volumes of digital data in real time is particularly heightened in the areas of national security, law enforcement, and intelligence. Agencies faced with ongoing digital and physical threats from various parts of the world are tasked with warning communities before an attack, implementing emergency preparedness, securing borders and transportation arteries, protecting critical infrastructure and key assets, and defending against catastrophic terrorism. Similar demands also exist in other surveillance areas, including natural disasters, humanitarian emergencies, public health events, public opinion, consumer product issues, and morale. Capability to detect potential events early on and monitor such plots continuously before they are carried out is most critical. The data on global networks can potentially give information-seeking organizations all the information they need. The key question is how to effectively and carefully sort and search vast amounts of data. The conventional approach to identifying events of interest is to examine data or streams of data for keywords related to topics of interest. When relevant documents are detected (e.g., by Boolean keyword searches, logistic regression, and/or Bayesian or other classifiers), they are then made available to human analysts, who examine the resulting corpus of retrieved material and form interpretations. Another common approach is to monitor a numerical variable (e.g., temperature, rainfall, number of inspection alerts, etc.) for anomalies and, when an anomaly is found or thought to be found, focus additional scrutiny or possibly undertake an investigation looking for a potential event. While these conventional methods are the norms, they are often inefficient. They are often done on an ad hoc basis once an event (for example, a food safety event) has been discovered or hypothesized. Accordingly, conventional methods run the risk of not identifying surprises because surprises do not occur frequently (and are therefore unlikely to be considered as an interpretation of observed data) and, by definition, conventional methods rely on a priori knowledge. For example, keyword searches look for terms identified by a human analyst, machine classifiers are trained on the familiar, and logistic regression looks for risk factors of predefined, desired outcomes. Similarly, monitoring numerical variables that are “born digital” (e.g., meteorological factors from sensors or counts of tests failed at inspection centers) can be limited in terms of sensitivity and specificity and may or may not be appropriate for the gamut of events of interest. Data for food event surveillance, for example, are generally drawn from many sources. The providence of those data (who produced the data, how were they measured, and the path the data took between production and acquisition) must be understood so that limitations and bias can be assessed (and estimated if possible). How data are cleaned (i.e., prepared for machine analysis) and how they are processed can introduce further error and bias, which must be understood if results are to be interpreted correctly. Methods centered on data not born digitally (or data of unknown or questionable providence or data that are not cleaned according to a consistent methodology) produce results that can be unclear if assumptions regarding the data are made that are not documented, normally explored, or defensible. Recent patent applications have described systems that allow the available corpus of data (usually publicly-available documents) to dictate potential hypotheses or potential events. U.S. Pat. Pub. No. 2015/0235138 and U.S. Pat. Pub. No. 2016/0358087 describe coding documents according to the ontology, populating a multi-dimensional ontology space by adding points i