US-12626148-B2 - Methods and systems for discovering and classifying application assets and their relationships
Abstract
Various methods, apparatuses/systems, and media for implementing a data discovery module are disclosed. A repository includes one or more memories that stores application code for each application among a plurality of applications. A processor is operatively connected to the repository via a communication network. The processor scans the application source code for each application among the plurality of applications; identifies, in response to scanning, all technical assets and their relationships within each application; harvests technical metadata from the technical assets and their relationships to identify what information is used, stored, created, and moved by the application; implements machine learning algorithms to automatically assign descriptive and administrative metadata at a field level; loads the assigned descriptive and administrative metadata into an enterprise data catalog; and creates, in response to loading, a knowledge map, thereby providing a fine-grain level understanding of data within the technical assets.
Inventors
- Johnathan Seungtae RA
- Sergei Z Maluszycki
- Mark A Jackson
- John Wright
Assignees
- JPMORGAN CHASE BANK, N.A.
Dates
- Publication Date
- 20260512
- Application Date
- 20211214
Claims (12)
- 1 . A method for discovering and classifying application assets and their relationships corresponding to an application by utilizing one or more processors and one or more memories, the method comprising: implementing, by a language-agnostic data discovery device (DDD), a language-agnostic data discovery module (DDM), wherein the one or more processors are embedded within the DDD, wherein the DDM is configured to discover and classify assets of the application and their relationships for the purpose of automated data lineage, thereby providing a fine-grain understanding of data within the technical assets, wherein the DDM includes a plurality of modular application programming interfaces (APIs) corresponding respectively to a scanning module, an identifying module, a harvesting module, an implementing module, a loading module, a creating module, a publishing module, a subscribing module, and an updating module, wherein each module being called via a corresponding API; scanning, by calling the scanning module via a first API, application source code stored in a repository embedded within the DDD, the repository including one or more memories configured to store application source code, API specification definition files, data files, and machine learning models for a plurality of applications; identifying, in response to scanning, all technical assets and their relationships within each application by calling the identifying module via a second API, wherein the fine-grain understanding of data refers to identifying all data stores held by application, and identifying all tables and columns within the data stores corresponding to each application; harvesting, in response to identifying all tables and columns within the data stores, by calling the harvesting module via a third API, metadata from the identified technical assets and relationships to determine how data is stored, used, created, or moved within the application and between adjoining applications; implementing machine learning algorithms stored in the repository to automatically assign descriptive and administrative metadata at a field level by calling the implementing module via a fourth API; loading the assigned descriptive and administrative metadata into an enterprise data catalog by calling the loading module via a fifth API; creating, in response to loading, by calling the creating module via a sixth API, a fine-grain data lineage map representing movement of data within and across applications; automatically capturing, by the DDM, the technical assets and their relationships at the fine-grain level based on the lineage map; labeling, by the DDM, datasets from the technical assets with predefined taxonomies stored in the repository; publishing an event with payload corresponding to the predefined taxonomies by calling the publishing module via a seventh API; subscribing to the published event by calling the subscribing module via an eight API to monitor data movement; and automatically updating the lineage map in response to subscribing to the published event by calling the updating module via a ninth API, wherein the DDD and DDM are operable within a cloud-based computing environment configured to be storage-platform agnostic.
- 2 . The method according to claim 1 , wherein the technical assets include data stores, the APIs and services within each application.
- 3 . The method according to claim 1 , wherein the map includes telemetry and reporting from the enterprise data catalog for automatic triggers, data risk score cards, and automated policy enforcement.
- 4 . The method according to claim 1 , further comprising: automatically updating inventory of applications in response to subscribing to the published event.
- 5 . A system for discovering and classifying application assets and their relationships corresponding to an application, the system comprising: a repository including one or more memories that stores application code for each application among a plurality of applications; and a processor operatively connected to the repository via a communication network, wherein the processor is configured to: implement, by a language-agnostic data discovery device (DDD), a language-agnostic data discovery module (DDM), wherein the one or more processors are embedded within the DDD, wherein the DDM is configured to discover and classify assets of the application and their relationships for the purpose of automated data lineage, thereby providing a fine-grain understanding of data within the technical assets, wherein the DDM includes a plurality of modular application programming interfaces (APIs) corresponding respectively to a scanning module, an identifying module, a harvesting module, an implementing module, a loading module, a creating module, a publishing module, a subscribing module, and an updating module, wherein each module being called via a corresponding API; scan, by calling the scanning module via a first API, application source code stored in a repository embedded within the DDD, the repository including one or more memories configured to store application source code, API specification definition files, data files, and machine learning models for a plurality of applications; identify, in response to scanning, all technical assets and their relationships within each application by calling the identifying module via a second API, wherein the fine-grain understanding of data refers to identifying all data stores held each application, and identifying all tables and columns within the data stores corresponding to each application; harvest, in response to identifying all tables and columns within the data stores, by calling the harvesting module via a third API, metadata from the identified technical assets and relationships to determine how data is stored, used, created, or moved within the application and between adjoining applications; implement machine learning algorithms stored in the repository to automatically assign descriptive and administrative metadata at a field level by calling the implementing module via a fourth API; load the assigned descriptive and administrative metadata into an enterprise data catalog by calling the loading module; create, in response to loading, by calling the creating module via a sixth API, a fine-grain data lineage map representing movement of data within and across applications; automatically capture, by the DDM, the technical assets and their relationships at the fine-grain level based on the lineage map; label, by the DDM, datasets from the technical assets with predefined taxonomies stored in the repository; publish an event with payload corresponding to the predefined taxonomies by calling the publishing module via a seventh API; subscribe to the published event by calling the subscribing module via an eight API to monitor data movement; and automatically update the lineage map in response to subscribing to the published event by calling the updating module via a ninth API, wherein the DDD and DDM are operable within a cloud-based computing environment configured to be storage-platform agnostic.
- 6 . The system according to claim 5 , wherein the technical assets include data stores, the APIs, and services within each application.
- 7 . The system according to claim 5 , wherein the map includes telemetry and reporting from the enterprise data catalog for automatic triggers, data risk score cards, and automated policy enforcement.
- 8 . The system according to claim 5 , wherein the processor is further configured to: automatically update inventory of applications in response to subscribing to the published event.
- 9 . A non-transitory computer readable medium configured to store instructions for discovering and classifying application assets and their relationships corresponding to an application, wherein when executed, the instructions cause a processor to perform the following: implementing, by a language-agnostic data discovery device (DDD), a language-agnostic data discovery module (DDM), wherein the one or more processors are embedded within the DDD, wherein the DDM is configured to discover and classify assets of the application and their relationships for the purpose of automated data lineage, thereby providing a fine-grain understanding of data within the technical assets, wherein the DDM includes a plurality of modular application programming interfaces (APIs) corresponding respectively to a scanning module, an identifying module, a harvesting module, an implementing module, a loading module, a creating module, a publishing module, a subscribing module, and an updating module, wherein each module being called via a corresponding API; scanning, by calling the scanning module via a first API, application source code stored in a repository embedded within the DDD, the repository including one or more memories configured to store application source code, API specification definition files, data files, and machine learning models for a plurality of applications; identifying, in response to scanning, all technical assets and their relationships within each application by calling the identifying module via a second API, wherein the fine-grain understanding of data refers to identifying all data stores held by each application, and identifying all tables and columns within the data stores corresponding to each application; harvesting, in response to identifying all tables and columns within the data stores, by calling the harvesting module via a third API, metadata from the identified technical assets and relationships to determine how data is stored, used, created, or moved within the application and between adjoining applications; implementing machine learning algorithms stored in the repository to automatically assign descriptive and administrative metadata at a field level by calling the implementing module via a fourth API; loading the assigned descriptive and administrative metadata into an enterprise data catalog by calling the loading module via a fifth API; creating, in response to loading, by calling the creating module via a sixth API, a fine-grain data lineage map representing movement of data within and across applications; automatically capturing, by the DDM, the technical assets and their relationships at the fine-grain level based on the lineage map; labeling, by the DDM, datasets from the technical assets with predefined taxonomies stored in the repository; publishing an event with payload corresponding to the predefined taxonomies by calling the publishing module via a seventh API; subscribing to the published event by calling the subscribing module via an eight API to monitor data movement; and automatically updating the lineage map in response to subscribing to the published event by calling the updating module via a ninth API, wherein the DDD and DDM are operable within a cloud-based computing environment configured to be storage-platform agnostic.
- 10 . The non-transitory computer readable medium according to claim 9 , wherein when executed, the instructions cause the processor to perform the following: automatically updating inventory of applications in response to subscribing to the published event.
- 11 . The non-transitory computer readable medium according to claim 9 , wherein the technical assets include data stores, the APIs and services within each application.
- 12 . The non-transitory computer readable medium according to claim 9 , wherein the map includes telemetry and reporting from the enterprise data catalog for automatic triggers, data risk score cards, and automated policy enforcement.
Description
CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of priority from U.S. Provisional Patent Application No. 63/134,314, filed Jan. 6, 2021, which is herein incorporated by reference in its entirety. TECHNICAL FIELD This disclosure generally relates to data governance, and, more particularly, to methods and apparatuses for discovering and classifying application assets and their relationships for the purpose of data lineage. BACKGROUND The developments described in this section are known to the inventors. However, unless otherwise indicated, it should not be assumed that any of the developments described in this section qualify as prior art merely by virtue of their inclusion in this section, or that those developments are known to a person of ordinary skill in the art. Data governance may be of importance for a large organization, such as J.P. Morgan and Chase (JPMC). In JPMC, there may be a large number of applications deployed to production and they may intercommunicate in order to deliver a business value. Describing data to keep accurate inventory for these applications is fundamental to data governance. It may seek to provide capabilities to answer some basic questions, for example, what data may be needed and where (data requirements), what data may be currently available (data in place—where data is held/stored), where does the data come from and go to (data in motion—the lineage of how data moves between one place and another), where should the data come from (data authority—designation of data locations and systems of record (SOR) and authoritative data sources (ADS)), what data should be shared the most (reference data), etc. Conventional tools may only provide manual tracking which is error prone and may prove to be extremely difficult and time consuming to manually keep track of this data, thereby failing to ensure accurate inventory and lineage information. Conventional tools also lack capabilities to get to finer grained assets and to the code as required for adequate data control. Further, in this application programming interface (API) first world, data in motion may need to be captured and related at the right grain of execution (API/Events). Conventional tools lack capabilities of capturing and relating the data in motion at the right grain of execution. Thus, there is a need for an automation of this inventory in order to provide a more precise (“fine-grain”) understanding of data within the technical assets to solve today's data governance problem. SUMMARY The present disclosure, through one or more of its various aspects, embodiments, and/or specific features or sub-components, provides, among others, various systems, servers, devices, methods, media, programs, and platforms for implementing a language agnostic data discovery module for discovering and classifying application assets and their relationships for the purpose of data lineage, thereby providing a more precise (“fine-grain”) understanding of data within the technical assets, but the disclosure is not limited thereto. The present disclosure, through one or more of its various aspects, embodiments, and/or specific features or sub-components, also provides, among others, various systems, servers, devices, methods, media, programs, and platforms for implementing a data discovery module for automated data lineage which can discover extensive detail about the components of the application, but the disclosure is not limited thereto. This automated data lineage solution, according to exemplary embodiments, allows more detailed view of how data is stored within an application (i.e., identifies all data stores held by the application, identifies all tables and columns within those data stores, etc.), how data moves within an application (i.e., which services and events utilize data from the data store, which APIs distribute data from the data store, etc.), how data moves between applications when adjoining applications are scanned (i.e., which APIs distribute data to other applications, which batch files are sent to other applications, etc.), but the disclosure is not limited thereto. The present disclosure, through one or more of its various aspects, embodiments, and/or specific features or sub-components, also provides, among others, various systems, servers, devices, methods, media, programs, and platforms for implementing a data discovery module that automatically captures assets and their relationships at the right grain, identify technical assets and relationships (through code scanners), harvests physical data structure, labels datasets with conceptual taxonomies, provides sustainable evergreen solution, provides accurate inventory and lineage information, provides appropriate controls at the right unit of management, provides impact analysis, etc., but the disclosure is not limited thereto. According to an aspect of the present disclosure, a method for discovering and classifying application assets and