CN-121979903-A - Hive QL generation method and system based on componentized data stream

CN121979903ACN 121979903 ACN121979903 ACN 121979903ACN-121979903-A

Abstract

The invention discloses a Hive QL generation method and a system based on componentized data flow, comprising the steps of constructing a data processing directed acyclic graph, carrying out topological structure serialization and generating a structured graph data file; the structured graph data file is compiled to generate a structured Hive QL script. The invention constructs the directed acyclic graph of data processing and sequences the topological structure thereof, so that complex data processing logic can be intuitively defined and persisted, accurate input is provided for subsequent automatic compiling, and logic errors possibly caused by manually writing codes are radically avoided.

Inventors

ZHANG CHEN
JIANG TAO
Shan Junquan
WANG YUNZHE
CHEN JIE
HU MIN

Assignees

南京莱斯信息技术股份有限公司

Dates

Publication Date: 20260505
Application Date: 20251230

Claims (8)

1.A Hive QL generation method based on a componentized data stream is characterized by comprising the following steps: 1) Constructing a data processing directed acyclic graph, and carrying out topological structure serialization to generate a structured graph data file; 2) The structured graph data file is compiled to generate a structured Hive QL script.
2. The Hive QL generation method based on componentized data stream according to claim 1, wherein the step 1) specifically comprises: 11 The method comprises the steps of) constructing a data processing directed acyclic graph, wherein the graph comprises a plurality of nodes, each node marks the type of the node, the graph comprises at least one node with a data source type, at least one node with an intermediate processor type and at least one node with a data writing type, and the nodes are connected through directed edges, wherein the directed edges are used for defining the data flow direction dependency relationship among the nodes; 12 Encoding the node attribute and the connection relation between nodes in the data processing directed acyclic graph into a JSON character string according to a preset format, and generating a structured graph data file.
3. The Hive QL generation method based on componentized data stream according to claim 2, wherein the step 11) specifically comprises: 111 For the node with the type of the data source, inquiring the Hive metadata service to acquire the optional data table and field information of the node, and completing the configuration of the node; 112 For the node with the data writing type, inquiring Hive metadata service to obtain the structural information of the target table, and completing the configuration of the node.
4. The Hive QL generation method based on componentized data stream according to claim 3, wherein the step 2) specifically comprises: 21 Analyzing the structured graph data file, identifying all nodes and directed edges, and determining an execution sequence of the nodes based on the dependency relationship defined by the directed edges; 22 Traversing the execution sequence to generate a common table expression for each node; 23 All public table expressions are assembled with data write statements to generate a structured Hive QL script.
5. The Hive QL generation method based on componentized data stream according to claim 4, wherein the step 21) specifically comprises: 211 Analyzing the structured graph data file, and identifying all nodes and directed edges among the nodes; 212 Based on the directed edges in the data processing directed acyclic graph, performing topological ordering on the nodes to generate an execution sequence of all the nodes, wherein the execution sequence ensures that any node is executed after all upstream nodes.
6. The Hive QL generation method based on componentized data streams of claim 5, wherein the step 22) specifically comprises: processing each node in the execution sequence in a loop according to the execution sequence in the step 21), and executing the following steps in each loop: 221 Assigning a unique temporary result set identifier within the script to the currently processing node; the distribution rule is that the English description of the node type is spliced with the self-increasing serial number maintained by the system; 222 Selecting a corresponding SQL grammar template from a predefined template library according to the type of the current node; 223 Filling the temporary result set identifiers of all upstream nodes of the current node and the configuration parameters of the current node into the corresponding placeholders of the selected SQL grammar template to generate complete sub-queries; 224 Combining the generated sub-queries with the temporary result set identifier of the current node to form a common table expression in the format of the temporary result set identifier AS sub-query; 225 Judging whether all nodes are processed, if not, returning to step 221) to process the next node, and if so, ending.
7. The Hive QL generation method based on componentized data stream according to claim 6, wherein the step 23) specifically comprises: 231 Identifying all nodes with the data writing type from the execution sequence of the nodes according to the node types marked in the step 11); 232 Dividing all the common table expressions generated in step 224) by commas in the order of the execution sequences of the nodes, and adding a keyword wit in front of them to constitute a complete common table expression definition part; 233 For each node identified in step 231) as being a data write, performing the following operations: 2331 Obtaining the target table name and partition information appointed in the node configuration; 2332 Generating a data writing statement, wherein the statement has a format of INSERT OVERWRITE TABLE target table name partition information SELECT upstream temporary result set identifiers, and the upstream temporary result set identifiers are identifiers corresponding to upstream nodes directly relied by the data writing nodes in an execution sequence; 2333 The public table expression definition unit formed in step 232) is combined with the generated data write statement to form a structured Hive QL script.
8. A Hive QL generation system based on a componentized data stream, comprising: The diagram data file generation module is used for constructing a data processing directed acyclic graph, carrying out topological structure serialization and generating a structured diagram data file; And the Hive QL script generation module is used for compiling the structured graph data file to generate a structured Hive QL script.

Description

Hive QL generation method and system based on componentized data stream Technical Field The invention belongs to the technical field of big data processing, and particularly relates to a Hive QL generation method and system based on a componentized data stream. Background Currently, there are some visualization tools or platforms for simplifying data query and processing in the big data field, and the implementation schemes can be summarized as follows: The query builder based on form filling provides a graphical interface for users, and enables the users to select tables and fields and configure filtering and aggregation conditions through table unit elements such as drop-down menus, input boxes and the like. And the system generates a corresponding Hive QL query statement according to the user configuration splice. Node and wire based data flow designer such tools allow a user to build a data processing flow (i.e., directed acyclic graph, DAG) by dragging and connecting preset "processor" nodes (e.g., data sources, filters, connections, aggregations, etc.). The system converts the graph structure into an executable Hive QL script by parsing the graph structure. Although the above prior art, in particular node and wire based solutions, to some extent simplify the operation, there are inherent drawbacks that result in an inability to be efficiently and reliably applied in industrial-scale large data platforms; 1. Metadata support is weak and configuration is error prone-most tools are split with the Hive metadata service (Metastore) of the large data platform. When configuring data sources and fields, users still need to manually input or select from a large number of tables by virtue of memories, and the process is tedious and extremely prone to error. The tool cannot automatically acquire and utilize the table structure, field types, and platform data assets such as Chinese notes. 2. The generated script structure is chaotic and has poor maintainability, the internal structure of the SQL script generated by the existing visualization tool is often poor, and common output forms comprise: Single, complex multi-tiered nested queries, where all logic is compressed into one large, deep nested SELECT statement, are extremely poorly readable. A series of discrete temporary table operations, each step is generated as CREATE TABLE AS SELECT operations on the temporary table, the script is lengthy, and the intermediate result management is complex. Both the above structures cannot intuitively reflect the business logic blood relationship constructed by the user, so that scripts are difficult to read, debug and maintain, a 'script maintenance prison' is formed, and great difficulty is brought to subsequent operation and iteration. 3. Lack of deep integration with large data platforms many tools are feature-isolated and fail to be deep-fused as native components with the data asset management, unified rights management and job scheduling systems of large data platforms. Thus, a set of end-to-end closed-loop solutions from data definition, processing to delivery cannot be formed, limiting its utility value in an enterprise-class production environment. Disclosure of Invention Aiming at the defects of the prior art, the invention aims to provide a Hive QL generation method and a Hive QL generation system based on a componentized data stream, which are used for solving the core technical problems of complicated and error configuration, poor script maintainability and incapability of end-to-end delivery caused by metadata disjoint, disordered script structure and low platform integration level in the prior art. In order to achieve the above purpose, the invention adopts the following technical scheme: the invention discloses a Hive QL generation method based on a componentized data stream, which comprises the following steps: 1) Constructing a data processing directed acyclic graph, and carrying out topological structure serialization to generate a structured graph data file; 2) The structured graph data file is compiled to generate a structured Hive QL script. Further, the step 1) specifically includes: 11 The method comprises the steps of) constructing a data processing directed acyclic graph, wherein the graph comprises a plurality of nodes, each node marks the type of the node, the graph comprises at least one node with a data source type, at least one node with an intermediate processor type and at least one node with a data writing type, and the nodes are connected through directed edges, wherein the directed edges are used for defining the data flow direction dependency relationship among the nodes; 12 Encoding the node attribute and the connection relation between nodes in the data processing directed acyclic graph into a JSON character string according to a preset format, and generating a structured graph data file. Further, the step 11) specifically includes: 111 For the node with the type of the data source, i