CN-121979914-A - Data lake-based data service generation method and device and readable storage medium
Abstract
The application discloses a data service generation method, device and readable storage medium based on a data lake, which comprises the steps of receiving a data processing request initiated by a user, analyzing the data item to generate data processing logic, converting the data processing logic into a target query representation for execution in the data lake, acquiring data from the data lake and generating a target data set by executing the target query representation, constructing a data blood-edge relation of each field in the target data set, packaging the target data set into a data service with an access interface, and providing a blood-edge query function for an upstream data source and a downstream dependent entity of any field in the target data set based on the data blood-edge relation. The technical scheme provided by the application can obviously improve the convenience of data inquiry and the service development efficiency.
Inventors
- LIU XIAOCHEN
- WANG YONGLIANG
Assignees
- 远景能源有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251212
Claims (13)
- 1. A data service generation method based on a data lake, wherein data of a plurality of heterogeneous data sources are stored in the data lake, the method comprising: Receiving a data processing request initiated by a user, wherein the data processing request comprises data items selected by the user from the heterogeneous data sources, and the data items at least comprise a data table and field information; parsing the data items to generate data processing logic and converting the data processing logic into a target query representation for execution in the data lake to obtain data from the data lake and generate a target data set by executing the target query representation and to construct a data blood-lineage relationship for each field in the target data set; And packaging the target data set into a data service with an access interface, and providing a blood-edge query function for an upstream data source and a downstream dependent entity of any field in the target data set based on the data blood-edge relationship.
- 2. The method of claim 1, wherein said parsing the data item generates data processing logic comprising: Determining an association relationship between the data tables based on metadata of the data tables; And generating the data processing logic according to the association relation and the field information, wherein the data processing logic at least comprises one of data connection, data conversion and data aggregation.
- 3. The method of claim 2, wherein said generating said data processing logic from said association and each of said field information comprises: Constructing a connection operation for combining a plurality of data tables according to the association relation; based on the join operation, the fields characterized by the field information are determined as output columns and associated with corresponding conversion or aggregation operations to generate the data processing logic.
- 4. The method of claim 2, wherein the converting the data processing logic into a target query representation for execution in the data lake comprises: Parsing the data processing logic into a target syntax tree; And rewriting the target grammar tree into the target query representation based on the query specification supported by the data lake.
- 5. The method of claim 4, wherein the executing the target query represents obtaining data from the data lake and generating a target data set comprises: executing the target query representation, computing based on data in the data lake to generate an intermediate result set; the intermediate result set is stored in the data lake to form the target data set.
- 6. The method of claim 5, wherein said constructing a data blood-edge relationship for each field in said target dataset comprises: Determining a source field and a source data table to which each output field depends in the target data set, and recording a data conversion expression and table association conditions between the output field and the source field; storing the source field, the source data table, the data conversion expression and the table association condition as blood-edge information to a metadata management library, and binding the blood-edge information with the target data set.
- 7. The method of claim 6, wherein the blood-margin query function comprises: responding to an access request of a user to the target data set through a query engine, and analyzing a query plan generated by the query engine; And tracing back and returning a source data table and a source field corresponding to the accessed field in the target data set based on the query plan and the blood-edge information stored in the metadata management database.
- 8. The method of claim 1, wherein prior to receiving the user initiated data processing request, the method further comprises: configuring connection parameters of the heterogeneous data sources, and extracting data from the heterogeneous data sources according to a preset acquisition strategy; And carrying out quality check on the extracted data, and storing the data passing the quality check into the data lake.
- 9. A data service generation apparatus based on a data lake, the data lake storing data of a plurality of heterogeneous data sources, the apparatus comprising: The data receiving module is used for receiving a data processing request initiated by a user, wherein the data processing request comprises data items selected by the user from the heterogeneous data sources, and the data items at least comprise a data table and field information; The data analysis module is used for analyzing the data items to generate data processing logic, converting the data processing logic into target query representations used for being executed in the data lake, acquiring data from the data lake and generating a target data set by executing the target query representations, and constructing data blood relationship of each field in the target data set; And the data release module is used for packaging the target data set into data service with an access interface and providing a blood-edge query function for an upstream data source and a downstream dependent entity of any field in the target data set based on the data blood-edge relationship.
- 10. A data lake-based data service generating apparatus, comprising: A memory for storing a computer program; a processor for executing a computer program stored in the memory to cause the apparatus to perform the method of any one of claims 1 to 8.
- 11. A computer readable storage medium having stored thereon instructions which, when executed by a processor, cause the processor to implement the method of any of claims 1 to 8.
- 12. A computer program product, characterized in that the computer program product comprises computer program code which, when run on a computer, causes the computer to carry out the method according to any one of claims 1 to 8.
- 13. A chip, characterized in that it comprises a circuit for performing the method according to any of claims 1 to 8.
Description
Data lake-based data service generation method and device and readable storage medium Technical Field The present application relates to the field of data management technologies, and in particular, to a data service generating method and apparatus based on a data lake, and a readable storage medium. Background With the continuous deepening of enterprise digital transformation, a large number of heterogeneous data systems and applications are gradually formed inside large-scale manufacturing enterprises. Such as product design software in the development stage, sensor monitoring systems for production links, status monitoring tools for equipment operation and maintenance, and external data systems for docking suppliers. Because of different construction periods and different service-oriented scenes, the systems have obvious differences in technical architecture, data model, data format and the like. To integrate and cooperate with multi-source heterogeneous data, various data management techniques have been developed in the industry. However, existing data management techniques suffer from significant drawbacks. For example, most schemes only focus on the centralized storage of heterogeneous data, but neglect the reconstruction and service capability of the data, so that the data preparation link is highly dependent on ETL scripts or SQL logic written by technicians, and business personnel are difficult to directly participate in data processing and analysis. For another example, the existing blood margin analysis is generally remained at the table level, lacks fine granularity traceability of the field level, cannot quickly and accurately locate the problem source when the data is abnormal, and seriously affects the fault detection efficiency. In view of this, it is necessary to provide a new data service generation method to solve the above-described drawbacks. Disclosure of Invention The application aims to provide a data service generation method, a device and a readable storage medium based on a data lake, which can remarkably improve the convenience of data query and the service development efficiency. In order to achieve the aim, the application provides a data service generation method based on a data lake, wherein the data lake stores data of a plurality of heterogeneous data sources, the method comprises the steps of receiving a data processing request initiated by a user, wherein the data processing request comprises a data item selected by the user from the plurality of heterogeneous data sources, the data item at least comprises a data table and field information, analyzing the data item to generate data processing logic, converting the data processing logic into a target query representation used for being executed in the data lake, acquiring data from the data lake through executing the target query representation, generating a target data set, constructing a data blood edge relation of each field in the target data set, packaging the target data set into the data service with an access interface, and providing a blood edge query function for an upstream data source and a downstream dependent entity of any field in the target data set based on the data edge relation. In order to achieve the above purpose, the application further provides a data service generating device based on a data lake, wherein the data lake stores data of a plurality of heterogeneous data sources, the device comprises a data receiving module, a data issuing module and a data searching module, wherein the data receiving module is used for receiving a data processing request initiated by a user, the data processing request comprises a data item selected by the user from the plurality of heterogeneous data sources, the data item at least comprises a data table and field information, the data analyzing module is used for analyzing the data item to generate data processing logic, the data processing logic is converted into a target query representation used for being executed in the data lake, the target query representation is executed to acquire data from the data lake and generate a target data set, the data blood edge relation of each field in the target data set is constructed, and the data issuing module is used for packaging the target data set into a data service with an access interface and providing a blood edge searching function for an upstream data source and a downstream dependent entity of any field in the target data set based on the data set. In order to achieve the above object, another aspect of the present application also provides a data lake-based data service generating apparatus including a memory for storing a computer program, and a processor for executing the computer program stored in the memory, so that the apparatus performs the data lake-based data service generating method as described above. To achieve the above object, another aspect of the present application also provides a computer-readab