EP-4740108-A1 - METHOD FOR INTEGRATING AND ANALYSING HETEROGENOUS SOURCE DATA

EP4740108A1EP 4740108 A1EP4740108 A1EP 4740108A1EP-4740108-A1

Abstract

The present application relates to methods for integrating and analysing heterogenous data (3) from heterogenous source applications (2), in order to provide a solution to the denormalization inherent to the data models native to web-based database platforms, each one following a particular data configuration and, therefore, not respecting a single and standardized way of presenting data, not favouring a structured aggregation of information. To achieve this goal, the present application describes a computer- implemented method comprising: executing a computational data collection routine (1) to access and load heterogenous source data (3); providing an heterogenous data integration module (4) for transforming heterogeneous source data (3) into target data (5); implementing a data analysis process (6) based at least on target data (5), to generate output data (7); and feeding a normalized database (8) with output data (7).

Inventors

Ribeiro Bastos, Pedro Alexandre
Amorim Alves de Araújo, Tiago Filipe

Assignees

Bússola Diligente - Consultoria Lda

Dates

Publication Date: 20260513
Application Date: 20240628

Claims (14)

1. Computer-implemented method for integrating and analysing heterogenous source data (3) obtained from a plurality of heterogeneous source applications (2), comprising the steps of: - executing a computational data collection routine (1) to access and load heterogenous source data (3) from at least one heterogeneous source application (2); - providing an heterogenous data integration module (4) with specifications for transforming heterogeneous source data (3) into target data (5); - implementing a data analysis process (6) based at least on target data (5), to generate output data (7); - feeding a normalized database (8) with output data (7); wherein, providing an heterogenous data integration module (4) with specifications comprises: - inputting a first level data structure specification describing an intermediate representation of data, and a second level data structure specification describing a target data representation; - implementing an extraction framework specification to parse the first level data structure specification to extract data from the heterogeneous source data (3) in order to generate an intermediate representation of data; - implementing a data normalization framework specification to transform the intermediate representation of data into target data (5).
2. The method according to any of the previous claims, wherein the data normalization framework specification comprises executing Natural Language Processing computational routines configured to: - create a logical structure diagram from the first level data structure describing intermediate representation of data, defined by a plurality of smaller data components; - execute an information extraction computational routine to detect and categorize essential information data; - parse the second level data structure specification to transform essential information data into target data (5).
3. The method according to claim 2, wherein the information extraction computational routine is a Named Entity Recognition routine.
4. The method according to the previous claims 2 and 3, wherein the Natural Language Processing computational routines include implementing Natural Language Processing heuristics adapted to: - text pre-processing intermediate representation of data; - identify table information; - extract essential information data.
5. The method according to claim 4, wherein the heuristic to text pre- process intermediate representation of data includes: - separating numbers or letter from unintended symbols or punctuation; - removing multiple spaces or paragraphs; - separating previous/following unintended character for each unit measure symbol.
6. The method according to claim 4 or 5, wherein the heurist to identify table information is a Camelot heuristic adapted to: - verify column tables; - verify if the tables have more columns and rows than a predefined threshold; - obtain header keywords; and - extract table items.
7. The method according to any of the claims 4 to 6, wherein extracting essential information data includes implementing the following sequence of algorithms: - Fuzzy matching; - Entity ruler extraction; - Item construction; and - Item verification.
8. The method according to any of the previous claims, wherein the computational data collection routine (1) is executed periodically.
9. The method according to any of the previous claims, wherein the computational data collection routine (1) is implemented via a File Transfer Protocol or a web Scraping process.
10. The method according to any of the previous claims, wherein the data analysis process (6) is a pattern recognition algorithm.
11. The method according to any of the previous claims, further comprising: - providing a user interface platform (8) with input means for generating user profile data (9); - implementing the data analysis process based at least on target data (5) and on the user profile data (9), to generate output data (7).
12. The method according to claims 10 and 11, wherein user profile data (8) includes user-preference data, and wherein, the data analysis process (6) is configured to identify patterns in target data matching with user-preference data; the output data being data patterns that match the user-preference data.
13. The method according to any of the previous claims, wherein the data analysis process (6) relates to Business Intelligence algorithms.
14. The method according to any of the previous claims, further comprising: - accessing to a translation service adapted to dynamically translate output data into a preconfigured language, optionally, the translation service is a cloud service.

Description

DESCRIPTION METHOD FOR INTEGRATING AND ANALYSING HETEROGENOUS SOURCE DATA FIELD OF THE APPLICATION The present application is enclosed in the field of integration of data from heterogeneous source applications. More specifically, the present application relates to methods for integrating and analysing heterogenous data from heterogenous source applications. PRIOR ART Heterogeneous data are any data with high variability of data types and formats. Data from distinct source applications are often heterogenous because they stem from independent and disparate activities and are managed and maintained by different application owners. Therefore, these data often differ both in their values and in their structures, even if they relate to the same phenomena. The ability to deal with data heterogeneity in an effective and efficient way is of utmost importance for data integration systems and a prerequisite to several other applications. For instance, when focusing on the Web of data, it enables semantic search in terms of entities and relations on top of the Web of text and deep reasoning using related ontologies, thus creating the Web of knowledge. Therefore, when dealing with multiple heterogeneous data source applications, the final aim is often to fuse the different manifestations of the same real- world entity to get a unified view that gives users the illusion of interacting with one single data source. Currently, databases used for design and engineering employ a variety of different data models, interface languages, naming conventions, data semantics, schemas, and data representations. Thus, a fundamental problem for concurrent engineering is the sharing of heterogeneous information among a variety of design resources. Successful concurrent engineering also requires access to data from multiple stages of the design life-cycle, but the diversity among data from different tools and at different stages creates serious barriers. The present solution intended to innovatively overcome such issues. SUMMARY OF THE APPLICATION It is therefore the object of the present application to be a solution to the denormalization inherent to the data models native to web-based database platforms, each one following a particular data configuration and, therefore, not respecting a single and standardized way of presenting data, not favouring a structured aggregation of information. To achieve this goal, the present application describes a computer- implemented method for integrating and analysing heterogenous source data obtained from a plurality of heterogeneous source applications, which comprises the steps of: - executing a computational data collection routine to access and load heterogenous source data from at least one heterogeneous source application; - providing an heterogenous data integration module with specifications for transforming heterogeneous source data into target data; - implementing a data analysis process based at least on target data, to generate output data; - feeding a normalized database with output data. More specifically, the step of providing an heterogenous data integration module with specifications comprises: - inputting a first level data structure specification describing an intermediate representation of data, and a second level data structure specification describing a target data representation; - implementing an extraction framework specification to extract data from the heterogeneous source data in order to generate an intermediate representation of data; - implementing a data normalization framework specification to transform the intermediate representation of data into target data. The proposed solution uses and orchestrates the simultaneous operation of a set of innovative procedures based on, for example, Robotic Process Automation, Natural Processing Language and Artificial Intelligence mechanisms in order to carry out the collection of data from the various heterogeneous source applications, their processing and analysis, as well as a set of advanced data analysis aimed at providing aggregated information to a user. The synchronized and automated operation of all these elements with the aim of producing knowledge encapsulates the innovative nature of the method described in the present application, as well as represents a technical challenge in the sense of putting all the elements in an effective and efficient dialogue with each other. DESCRIPTION OF FIGURES Figure 1 - a block diagram illustrating a first exemplary chain of data elements according to the method described in the present application. The reference sign represents: 1 - computational data collection routine; 2 - heterogeneous source application; 3 - heterogeneous source data; 4 - heterogenous data integration module; 5 - target data; 6 - data analysis process; 7 - output data; 8 - normalized database. Figure 2 - a block diagram illustrating a second exemplary chain of data elements according to the method describ