US-12625851-B2 - System for interconnecting data management for scientific research with a registry for publishing of quality-controlled data
Abstract
The invention is a registry in the cloud that stores metadata and quality control metadata for scientific studies and datasets that it has profiled with an AI-based quality control engine that runs in the data contributor's environment.
Inventors
- John T. Walker
Assignees
- Lifetime Omics, Inc.
Dates
- Publication Date
- 20260512
- Application Date
- 20240929
Claims (19)
- 1 . A computer-implemented system comprising: one or more processors and memory storing instructions; and distributed functional modules executed by the one or more processors and comprising: an indexing and query engine; a study/data profiler module; a scored study/data profiles and metadata portable formatting module (MPFM) module; a profile and guality-control (QC) results metabase; a study/data registry publisher; a study/data QC profiles template store; a QC engine; a user interface for QC; a user interface for study/data registration; a user interface for data discovery and query access; and a registry entries exposed via domain name server (DNS) module; wherein the user interface for QC, the study-data QC profiles template store, and the QC engine execute as a QC subsystem within an external data-management environment in which datasets reside, and the study-data profiler, the indexing and query engine, the scored study-data profiles and MPFM module, the profile and QC results metabase, and the study-data registry publisher execute as a registry subsystem in a registry cloud enVironment remote from the external data- management environment; wherein the QC subsystem, in the external data-management environment, applies template-driven QC validations selected from the study-data QC profiles template store to datasets of a registered study to produce QC results comprising a QC report and a quantitative QC score, and the QC engine establishes a secure, encrypted session socket to the scored study-data profiles and MPFM module in the registry cloud environment to transmit the QC results; wherein the scored study-data profiles and MPFM module, in cooperation with the indexing and query engine and the profile and QC results metabase, generates an MPFM-formatted, portable metadata object for the study that encodes study and dataset attributes and includes the QC report and the quantitative QC score, and indexes the MPFM-formatted, portable metadata object in the profile and QC results metabase for discovery based on the encoded attributes and QC assessment; and wherein the user interface for study-data registration, the user interface for data discovery and query access, the study-data registry publisher, and the registry-entries-exposed-via-DNS module: support registering studies and associated datasets in the registry cloud environment while maintaining the datasets hosted in the external data-management environment; publish a meta-tagged abstract and a DNS-based trust profile for the study and its datasets, the trust profile reflecting contributor identity, QC assessment, and dataset reusability ratings, such that the meta- tagged abstract and DNS-based trust profile are publicly discoverable via URLs and DNS while the datasets remain externally hosted; and enable, via the user interface for data discovery and query access, searching and discovering registered studies using the indexed MPFM-formatted, portable metadata objects and initiating or responding to access, collaboration, or communications relating to discovered studies or datasets.
- 2 . The system as in claim 1 wherein: the indexing and query engine indexes, in the profile and QC results metabase, MPFM-formatted, portable metadata objects and associated study and dataset metadata to support discovery of studies and datasets based on study attributes, dataset characteristics, contributor identity, and QC assessment.
- 3 . The system as in claim 1 wherein: organizes and annotates research data for a registered study, extracting key study-level and dataset-level metrics and metadata and generating study-data profiles that are supplied to the scored study-data profiles and MPFM module for inclusion in the MPFM-formatted, portable metadata obiects.
- 4 . The as in claim 1 wherein the scored study-data profiles and MPFM module: formats and annotates metadata obiects for search-engine tagging and registry cataloging; generates the MPFM-formatted, portable metadata obiects to encapsulate study and dataset attributes, QC reports, and QC scores; and updates the study-data QC profiles template store based on study and data types processed so that subsequent QC validations are adapted to prior registry experience.
- 5 . The system as in claim 1 wherein the profile and QC results metabase stores and maintains MPFM-formatted, portable QC objects and study metadata as entries that are distinct from underlying datasets and that are indexed by the indexing and query engine to support queries over study attributes, dataset characteristics, QC scores, and trust-related parameters.
- 6 . The system as in claim 1 , wherein the study-data registry publisher creates a meta-tagged abstract of a study and its datasets for publication.
- 7 . The system as in claim 1 wherein: the user interface for QC, the QC Engine, and the study/data profiles QC template store execute as a subsystem within the external data-management environment.
- 8 . The system as in claim 1 , wherein the QC engine: evaluates data quality, validates interoperability, identifies and corrects errors, generates QC reports, and calculates a QC score; and establishes the secure, encrypted session socket to the scored study-data profiles and MPFM module.
- 9 . The system as in claim 1 , wherein the study-data QC profiles template store manages standardized QC templates comprising predefined rules and parameters.
- 10 . The system as in claim 1 , wherein the user interface for QC guides users to run QC within the external data-management environment.
- 11 . The system as in claim 1 , wherein the user interface for study-data registration guides registering a study with datasets.
- 12 . The system as in claim 1 , wherein the user interface for data discovery and query access: guides searching or discovering data; initiates communication, or collaboration requests; and supports replies to received requests.
- 13 . The system as in claim 1 , wherein the registry-entries-exposed-via-DNS module exposes study and data metadata via the Internet.
- 14 . A computer-implemented method executed bv one or more processors comprising registering, via a user interface for study-data registration and a study-data registry publisher in a registry cloud environment, studies and associated datasets for the registry cloud environment while maintaining the datasets in an external data-management environment; setting up, within the external data-management environment, a QC subsystem comprising a QC engine, a user interface for QC, and a study-data-QC profiles template store to select template-driven QC validations based on study or dataset characteristics; running, by the QC engine in the external data-management environment, the QC validations on at least one dataset of a registered study to produce QC results comprising a QC report and a quantitative QC score; establishing, bv the QC engine, a secure, encrypted session socket to a scored study-data profiles and metadata portable formatting (MPFM) module in the registry cloud environment and transmitting the QC results to the MPFM module; generating, bv the MPFM module in cooperation with an indexing and query engine and a profile and QC results metabase, an MPFM-formatted, portable metadata object for the registered study that encodes study and dataset attributes and includes the QC report and the quantitative QC score, and indexing the MPFM-formatted, portable metadata object in the profile and QC results metabase for discovery; publishing, bv the study-data registry publisher and a registry-entries-exposed-via-domain-name-svstem (DNS) module, a meta-tagged abstract and a DNS-based trust profile for the registered study and its datasets, the DNS-based trust profile reflecting contributor identity, QC assessment, and dataset reusability ratings, the meta-tagged abstract and DNS-based trust profile being publicly discoverable via URLs and DNS while the datasets remain hosted in the external data-management environment; and facilitating, via a user interface for data discovery and query access, discovery of the registered study based on the indexed MPFM-formatted, portable metadata object and initiating or responding to communications relating to the discovered study or dataset.
- 15 . The method as in claim 14 wherein selecting the template-driven QC validations based on study or dataset characteristics comprises updating one or more templates in the study-data QC profiles template store in response to prior QC results for related studies or datasets, and subsequently selecting QC validations according to the updated templates.
- 16 . The method as in claim 14 wherein generating the MPFM-formatted, portable metadata obiect comprises encapsulating, in the object, study and dataset attributes, the QC report, and the quantitative QC score without including the underlying datasets, such that the MPFM-formatted, portable metadata object is distinct from the underlying datasets.
- 17 . The method as in claim 14 wherein indexing the MPFM-formatted, portable metadata object in the profile and QC results metabase comprises storing the MPFM-formatted, portable metadata object as a registry entry that is separate from any data store containing the underlying datasets and indexing the registry entry for queries over study attributes, dataset characteristics, QC scores, and trust-related parameters.
- 18 . The method as in claim 14 wherein the DNS-based trust profile published for the registered study references the MPFM-formatted, portable metadata object in the profile and QC results metabase while the datasets remain hosted in the external data-management environment.
- 19 . The method as in claim 14 wherein: publishing the metadata on the internet for public discovery comprises: making a meta-tagged abstract of a study and its data via its metadata available via URLs and publicly accessible via the solution platform's DNS zones; providing a user interface for study/data discovery & query access; and providing a user interface for communications related to the discoyered study/data.
Description
TECHNICAL FIELD The invention is an online cloud-based registry that interoperates with externally hosted scientific data and users to facilitate cooperative data quality control and sharing. BACKGROUND OF INVENTION Scientific research often encounters difficulties in data sharing due to incompatible formats, security concerns, and lack of centralized platforms. This hinders collaboration and slows down scientific discoveries, especially in the era of artificial intelligence which requires access to large amounts of data. There is a need to centralize the metadata of quality data so that users can contribute information about their quality research to that centralized registry where these and other users can find and seek out use of quality data located in myriad locations. BRIEF DESCRIPTION OF INVENTION The invention is meant to facilitate the centralized registry of quality-controlled scientific data regardless of where it is stored. The invention system is a distributed system comprising two subsystems, a registry environment in its own cloud, and a quality control application installed in a scientific user's data-management environment. Users interact with these systems via browser/internet user interfaces. It is operative to seamlessly interoperate with external scientific data-management systems and repositories in order to begin centralizing a registry of disparately located scientific data so that research users can centrally register data they own and, these research users, or others, can centrally query metadata needed to efficiently locate data and complete current or future research projects. The system is operative to enable data contributors to register and provide information about research data they wish to contribute, the study that generated the data, then quality control the data to be contributed before putting its metadata in a centralized registry. A researcher who wants to find needed data would then have a registry providing metadata with study, data and data quality-control information related to that data. A researcher upon reusing and improving the quality of the data may register the reused and curated data in the registry becoming a data contributor. A researcher may also collaborate with data contributors in collaboratively curating the data or in a new study that reuses the data. There is no effective way to actually centralize scientific data because it resides in myriad, disconnected stores. But, there is a way to centralize information about that data that helps researchers find what they need, regardless of where it resides, and seek access permissions, offer financial support, and/or offer collaborative support. BRIEF DESCRIPTION OF DRAWINGS FIG. 1 shows an embodiment of the system. FIG. 2 shows an exemplary quality-control record for data to be contributed. FIG. 3 shows an exemplary quality-control score for a data contribution. FIG. 4 shows an embodiment of how a data contributor interfaces with the registry and related systems. FIG. 5 shows an embodiment of how a data consumer interfaces with the system. FIG. 6 shows an exemplary metadata model and the elements it comprises. FIG. 7 shows exemplary method steps for data contribution. DETAILED DESCRIPTION OF INVENTION Scientific progress and breakthroughs in the era of artificial intelligence (AI) rely on quality data, data sharing, and collaboration but the process for doing so is confounded by a lack of centralized dataset information and the unknown quality or value of the data. The reality is that scientific datasets may be stored on myriad systems in an inherently decentralized environment. Even information about those datasets is inherently decentralized. As a result, researchers in need of critical dataset information may have to rely more on serendipity rather than an efficient means of discovery and permissions. The invention herein disclosed is aimed at offering a means of secure collection of dataset metadata, making sure the data is quality-controlled, and preserving the manifestations of data governance to which some datasets must adhere. Attempting to centralize the storage and access of datasets would be a long-term and herculean endeavor. But, providing a means for centralizing information about these disparate datasets can be done. It requires a system and method for contributing dataset information, or metadata, to a centralized registry, a means of evaluating the quality of the data, a means of discovering such metadata along with streamlined, secure means of offering funding, and/or seeking collaboration and communication with the data contributors. To that end, the system disclosed acts like an intermediary system that offers a way to contribute information about datasets, to seek information about datasets, and to do so in a predictable, orderly, quality-controlled way. Prospective data contributors and dataset seekers must meet qualifications in order to make use of the system. Data governance aspe