US-12625900-B2 - Unified data classification techniques
Abstract
A method, computer system, and a computer program product for data processing, comprising obtaining a plurality of files from a data source. These files are analyzed the files for information about the content and in order to determine structural information of each file. Once the files have been analyzed, information in each file may be sorted and categorized by common content. Sensitive information may also be extracted and categorized separately. Information may then be then merged using the categories to create a single unified file.
Inventors
- Youngja Park
- Walid Rjaibi
- Ariel Farkash
- Mohammed Fahd Alhamid
- Stefano Braghin
- Jing Xin Duan
- Mokhtar Kandil
- Michael Vu Le
- Killian Levacher
- Micha Gideon Moffie
- Ian Michael Molloy
Assignees
- INTERNATIONAL BUSINESS MACHINES CORPORATION
Dates
- Publication Date
- 20260512
- Application Date
- 20220627
Claims (20)
- 1 . A method for data processing, comprising: obtaining a plurality of files from a data source to process a user request for data storage or retrieval, wherein said plurality of files include documents that are structured and documents that are unstructured; extracting logical locations of text units and other information including security applications from the plurality of files; using a controller to determine from said obtained plurality of files information relating to storage and isolation of files to be stored or retrieved in connection with the user request for data storage or retrieval; using a topic classification module executed by a processor and operative to generate and store a single classification document that, for each of the plurality of files and for text units within the plurality of files, specifies corresponding topic categories and sensitivity levels based on one or more user selections and based on user documents that include said extracted logical locations and said security applications as well as said information determined by the controller; analyzing content of said files and said single classification document to determine any common data information and structural information of each file, wherein analyzing content also includes extracting a plurality of sensitive data; parsing said obtained files to extract further information so as to create a data model; applying natural language processing (NLP) to said data model to determine a structure of the obtained files and providing an output file with results of said determination, wherein determining the structure of the obtained files during the NLP step includes tokenization; creating a common data representation by categorizing data from the analyzed files together into one common representation, wherein at least one category contains any sensitive data extracted; using the controller and the single classification document to determine, for each of the plurality of files, storage and isolation parameters for storing or retrieving the files in response to the user request; merging and compiling information that are similar into a unified file using said categories specified in the single classification document and completing said user request for storage or retrieval by providing the unified file with merged information, wherein said unified file has a single format and structure and identifies any sensitive data accordingly; and using a report generator module to provide a single unified final report based on the unified file and the single classification document, wherein said single unified final report includes all merged and compiled information and provides a classification for different types of the merged and compiled information.
- 2 . The method of claim 1 , wherein analyzing step further comprises: extracting sensitive information from said output file using an Entity Extraction module; and using a Topic Classification module to detect sensitive business data categories.
- 3 . The method of claim 1 , wherein said common data representation provides a plurality of information formats from said files in a unified manner.
- 4 . The method of claim 3 , wherein one or more data crawlers obtain data from data sources to identify unstructured documents.
- 5 . The method of claim 4 , wherein said data sources can be one or more repositories or databases.
- 6 . The method of claim 2 , wherein the determination of structure of obtained files includes sentence boundary detection, part of speech tagging and syntactic dependencies.
- 7 . The method of claim 2 , wherein the Entity Extraction module processes one or more data models and rules for extracting entities.
- 8 . The method of claim 2 , wherein a Topic Classification module categorizes content into user defined categories.
- 9 . The method of claim 8 , wherein said Topic Classification module provides at least one category designated as including sensitive data.
- 10 . The method of claim 9 , wherein sensitive information includes at least one of personal identifier, social security, birth date, confidential, top secret, and other sensitive data types.
- 11 . A computer system for data processing, comprising: one or more processors, one or more computer-readable memories, one or more computer-readable tangible storage media, and program instructions stored on at least one of the one or more tangible storage media for execution by at least one of the one or more processors via at least one of the one or more memories, wherein the computer system is capable of performing a method comprising: obtaining a plurality of files from a data source to process a user request for data storage or retrieval, wherein said plurality of files include documents that are structured and documents that are unstructured; extracting logical locations of text units and other information including security applications from the plurality of files; using a controller to determine from said obtained plurality of files information relating to storage and isolation of files to be stored or retrieved in connection with the user request for data storage or retrieval; using a topic classification module to generate and store a single classification document that, for each of the plurality of files and for text units within the plurality of files, specifies corresponding topic categories and sensitivity levels based on one or more user selections, said extracted logical locations, said security applications, and said information determined by the controller; analyzing content of said files and said single classification document to determine any common data information and structural information of each file, wherein analyzing content also includes extracting a plurality of sensitive data; parsing said obtained files to extract further information so as to create a data model; applying natural language processing (NLP) to said data model to determine a structure of the obtained files and providing an output file with results of said determination, wherein determining the structure of the obtained files during the NLP step includes tokenization; creating a common data representation by categorizing data from the analyzed files together into one common representation, wherein at least one category contains any sensitive data extracted; using the controller and the single classification document to determine, for each of the plurality of files, storage and isolation parameters for storing or retrieving the files in response to the user request; merging and compiling information that are similar into a unified file using said categories specified in the single classification document and completing said user request for storage or retrieval by providing the unified file with merged information, wherein said unified file has a single format and structure and identifies any sensitive data accordingly; and using a report generator module to provide a single unified final report based on the unified file and the single classification document, wherein said single unified final report includes all merged and compiled information and provides a classification for different types of the merged and compiled information.
- 12 . The computer system of claim 11 , wherein analysis is performed by an orchestration layer in communication with a controller.
- 13 . The computer system of claim 11 , wherein said controller is a user interface that communicates with a user.
- 14 . The computer system of claim 12 , wherein said orchestration later further includes a parser module, a natural language processor (NLP) module, an Entity Extractor module, a Topic Classification module, and an output generator module.
- 15 . The computer system of claim 12 , wherein said parsing module parses said obtained files so as to provide a data model to said NLP module; and said NLP module further determines the structure of the files; and said an Entity Extraction module extracts sensitive data.
- 16 . The computer system of claim 12 , wherein said Topic Classification module merges all analyzed information and provide a classification to the output module for generating said unified final report.
- 17 . A computer program product for data processing, comprising: one or more computer-readable storage media and program instructions stored on the one or more tangible storage media, the program instructions executable by a processor, the program instructions comprising instructions for: obtaining a plurality of files from a data source to process a user request for data storage or retrieval, wherein said plurality of files include documents that are structured and documents that are unstructured; extracting logical locations of text units and other information including security applications from the plurality of files; using a controller to determine from said obtained plurality of files information relating to storage and isolation of files to be stored or retrieved in connection with the user request for data storage or retrieval; using a topic classification module to generate and store a single classification document that, for each of the plurality of files and for text units within the plurality of files, specifies corresponding topic categories and sensitivity levels based on one or more user selections, said extracted logical locations, said security applications, and said information determined by the controller; analyzing content of said files and said single classification document to determine any common data information and structural information of each file, wherein analyzing content also includes extracting a plurality of sensitive data; parsing said obtained files to extract further information so as to create a data model; applying natural language processing (NLP) to said data model to determine a structure of the obtained files and providing an output file with results of said determination, wherein determining the structure of the obtained files during the NLP step includes tokenization; creating a common data representation by categorizing data from the analyzed files together into one common representation, wherein at least one category contains any sensitive data extracted; using the controller and the single classification document to determine, for each of the plurality of files, storage and isolation parameters for storing or retrieving the files in response to the user request; merging and compiling information that are similar into a unified file using said categories specified in the single classification document and completing said user request for storage or retrieval by providing the unified file with merged information, wherein said unified file has a single format and structure and identifies any sensitive data accordingly; and using a report generator module to provide a single unified final report based on the unified file and the single classification document, wherein said single unified final report includes all merged and compiled information and provides a classification for different types of the merged and compiled information.
- 18 . The computer program product of claim 17 , wherein analysis is performed by an orchestration layer in communication with a controller.
- 19 . The computer program product of claim 18 , wherein said orchestration layer further includes a parser, a natural language processor, an entity extractor module, a Topic Classification module, and an output generator module.
- 20 . The computer program product of claim 19 , wherein said Topic Classification module merges all analyzed information and provide a classification to an output module for generating said unified final report.
Description
BACKGROUND The present invention relates generally to the field of data processing and more particularly to techniques for merging data from multiple sources into a unified file and document. With the advent of technology, many business transactions may be conducted online without the requirement of in-person meetings. Information may be received only, and entirely online, medical results are returned via digital portals solely, contracts are signed remotely using computers and official documents are reproduced with digital signatures. Most of these transactions require the exchange of confidential and sensitive data. Consequently, data security and confidentiality can be very important and an ongoing challenge. Besides confidentiality issues, data may be provided from many sources and duplication of the same data should be limited. Therefore, data classification is a crucial step, especially for any data security assessments. Data can be very diverse in nature and dependent on the type of business. A large portion of this data may include customer information and even medical information that must be legally protected. In addition, data files come in a wide range of formats and may not be presented in a uniform format. SUMMARY Embodiments of the present invention disclose a method, computer system, and a computer program product for data processing. In one embodiment, a plurality of files may be obtained from a data source and analyzed to determine information about the content and to determine structural information of each file. A common data representation may be created. In one embodiment, the common data representation may provide the data content and structural information from a plurality of different file formats in a uniform way. Sensitive information may also be extracted. In one embodiment, at least one category contains sensitive data extracted. The analyzed information may be then merged into a unified file using the categories. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which may be to be read in connection with the accompanying drawings. The various features of the drawings are not to scale as the illustrations are for clarity in facilitating one skilled in the art in understanding the invention in conjunction with the detailed description. In the drawings: FIG. 1 illustrates a networked computer environment according to at least one embodiment; FIG. 2 provides an operational flowchart illustrating a method of document processing according to at least one embodiment; FIG. 3 provides a block diagram providing a microservice architecture according to one environment. FIG. 4 provides a more detailed illustration of the interconnections between the modules of FIG. 3; FIG. 5 provides a block diagram of internal and external components of computers and servers depicted in FIG. 1 according to at least one embodiment; FIG. 6 provides a block diagram of an illustrative cloud computing environment including the computer system depicted in FIG. 1, in accordance with one embodiment; and FIG. 7 provides a block diagram of functional layers of the illustrative cloud computing environment of FIG. 6, in accordance with an embodiment. DETAILED DESCRIPTION Detailed embodiments of the claimed structures and methods may be disclosed herein; however, it can be understood that the disclosed embodiments may be merely illustrative of the claimed structures and methods that may be embodied in various forms. This invention may, however, be embodied in many different forms and should not be construed as limited to the exemplary embodiments set forth herein. Rather, these exemplary embodiments may be provided so that this disclosure will be thorough and complete and will fully convey the scope of this invention to those skilled in the art. In the description, details of well-known features and techniques may be omitted to avoid unnecessarily obscuring the presented embodiments. The present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention. The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but may not be limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the