CN-121143771-B - Automatic data pipeline generation method, equipment and medium based on low-code platform

CN121143771BCN 121143771 BCN121143771 BCN 121143771BCN-121143771-B

Abstract

The invention discloses an automatic data pipeline generation method, equipment and medium based on a low-code platform, and belongs to the technical field of data processing. The method includes analyzing service demand text through natural language processing technology, extracting structural demand information and converting the structural demand information into a graphical pipeline model, screening an optimal basic template from a template library by adopting a semantic driven template matching algorithm, generating an intermediate template by combining a data type automatic mapping mechanism, and generating an object code and deploying the object code by automatically generating an engine through a large model conversion rule. The invention realizes the full-flow automation from the service requirement to the data pipeline, reduces the dependence on professionals, improves the construction efficiency and the suitability of the data pipeline, and can be widely applied to heterogeneous data source integration scenes.

Inventors

KANG ZHEN
HU LUNLIANG
GAO JU
DAI YAFEI
WU ZHENGANG
LONG JUNJIE
JIA HONGLIANG

Assignees

中建材信息技术股份有限公司
中建材信云智联科技有限公司
中建材信云智联科技有限公司北京分公司
中建材信云智联科技(北京)有限公司

Dates

Publication Date: 20260512
Application Date: 20250820

Claims (8)

1. An automated data pipeline generation method based on a low code platform is characterized by comprising the following steps: S1, analyzing a service demand text input by a user through a natural language processing technology, and extracting source end information, target end information, data processing rules and circulation constraint of a data pipeline to form structured demand information; S2, converting the analyzed demand information into a graphical pipeline model based on a preset visual data pipeline modeling language, wherein the visual data pipeline modeling language comprises a self-defined data stream node symbol, a data conversion operator symbol and a node connection rule; S3, screening a basic template with highest matching degree with the graphical pipeline model from a template library by adopting a semantic-driven pipeline template matching algorithm, wherein the matching algorithm carries out comprehensive scoring by calculating cosine similarity of a required semantic vector and a template semantic vector and combining data processing node type matching degree; S4, based on a data type automatic mapping mechanism, carrying out matching conversion on the source end data type and the target end data type in the basic template, and generating an intermediate template containing type conversion logic; S5, generating target codes according to the intermediate templates and special processing rules in service requirements through a conversion rule automatic generation engine based on the large model, and deploying the target codes to a designated running environment to complete automatic generation of the data pipeline; the visual data pipeline modeling language further comprises: dynamic arrow symbols for representing real-time data streams and static arrow symbols for representing batch data streams; Operator attribute panels for marking data cleaning, data filtering and data aggregation operations, wherein the attribute panels support a user to configure operator parameters in a pull-down selection or parameter input mode; The pipeline template matching algorithm driven by the semantics concretely comprises the following steps: carrying out semantic coding on the graphical pipeline model to generate a demand semantic vector containing a node type sequence, data flow characteristics and constraint condition characteristics; carrying out semantic coding on each basic template in the template library to generate a template semantic vector; Calculating cosine similarity of the demand semantic vector and the semantic vector of each template, and simultaneously counting the number proportion of nodes of the same type in the graphical pipeline model and the basic template to obtain node matching degree; And carrying out weighted summation on the cosine similarity and the node matching degree according to a preset weight, and selecting a basic template with the highest score as a matching result.
2. The method according to claim 1, wherein the parsing the business requirement text input by the user through the natural language processing technology specifically comprises: performing word segmentation, entity identification and relation extraction on the service demand text by adopting a pre-training language model, and identifying the source system name, the target system name, the data entity, the data field and the conversion relation among the fields of the data pipeline; determining the operation frequency requirement, fault-tolerant processing requirement and data consistency requirement of a user on the data pipeline through an intention recognition model.
3. The method of claim 1, wherein the data type automatic mapping mechanism comprises: Establishing a mapping table of the type of the cross database, wherein the mapping table comprises the corresponding relation among the relational database, the non-relational database and the data types in the file format; when the source end data type and the target end data type have a direct mapping relation, directly adopting a conversion rule in a mapping table; When the source end data type and the target end data type have no direct mapping relation, an adaptive code containing a data type conversion function is generated, and the conversion function carries out type conversion according to the data precision requirement and the service rule.
4. The method of claim 1, wherein the large model-based transformation rule automatic generation engine comprises inputting the intermediate templates, the special processing rules in the business requirements and the target running environment information into a pre-trained code generation large model to generate initial target codes, performing grammar checking and logic consistency checking on the initial target codes, and feeding back error information to the large model for secondary generation if errors exist until the correct target codes are generated.
5. The method of claim 1, further comprising visually displaying the generated data pipeline, supporting the user to adjust the data pipeline by dragging the node and modifying the node attribute, and synchronously updating the graphical model and the corresponding object code of the data pipeline in response to the adjustment operation of the user in real time.
6. The method of claim 1, wherein the object code comprises a data extraction code, a data conversion code, a data loading code, and a monitoring code, the monitoring code configured to collect, in real time, an operation state indicator of the data pipe, the operation state indicator comprising a data throughput, a conversion success rate, and a delay time.
7. An electronic device, comprising: Processor, and A memory arranged to store computer executable instructions that when executed cause the processor to perform the steps of the low code platform based automated data pipeline generation method of any one of claims 1 to 6.
8. A storage medium storing computer-executable instructions which, when executed, implement the steps of the low code platform based automated data pipeline generation method of any one of claims 1-6.

Description

Automatic data pipeline generation method, equipment and medium based on low-code platform Technical Field The present document relates to the field of data processing technologies, and in particular, to a method, an apparatus, and a medium for generating an automated data pipeline based on a low-code platform. Background Currently, enterprise data integration mainly relies on professional developers to manually write ETL codes to construct data pipelines, and in this way, the development efficiency is low and the maintenance cost is high. Although some visualization orchestration tools and low code platforms exist on the market, there are still a number of technical bottlenecks. The existing solution is poor in demand understanding, complex business rules of natural language description are difficult to accurately analyze, and especially recognition capability of implicit processing logic and special constraint conditions is weak. The traditional template matching method mainly depends on simple label classification, and cannot realize intelligent matching of semantic level, so that the template multiplexing effect is not ideal. At the technical implementation level, the type conversion mechanism of the existing tool is relatively stiff, and is difficult to cope with the processing requirements of non-standard data formats. The generated codes often lack a perfect exception handling mechanism, and cannot meet the requirement of the production environment on stability. The operation and maintenance monitoring function is relatively weak, and the operation and maintenance monitoring function is mainly remained on the basic operation state monitoring layer. As enterprise data environments become more complex, traditional development schemes have difficulty meeting business agility requirements. An innovative scheme capable of realizing full-flow automation from service requirements to operational pipelines is needed to solve the defects of the prior art in aspects of intelligent analysis, semantic matching, code generation and the like. Disclosure of Invention According to the embodiment of the invention, an automatic data pipeline generation method, equipment and medium based on a low-code platform are provided, and the aim is to solve the problems. According to an embodiment of the present invention, there is provided an automated data pipeline generation method based on a low-code platform, including: S1, analyzing a service demand text input by a user through a natural language processing technology, and extracting source end information, target end information, data processing rules and circulation constraint of a data pipeline to form structured demand information; S2, converting the analyzed demand information into a graphical pipeline model based on a preset visual data pipeline modeling language, wherein the visual data pipeline modeling language comprises a self-defined data stream node symbol, a data conversion operator symbol and a node connection rule; S3, screening a basic template with highest matching degree with the graphical pipeline model from a template library by adopting a semantic-driven pipeline template matching algorithm, wherein the matching algorithm carries out comprehensive scoring by calculating cosine similarity of a required semantic vector and a template semantic vector and combining data processing node type matching degree; S4, based on a data type automatic mapping mechanism, carrying out matching conversion on the source end data type and the target end data type in the basic template, and generating an intermediate template containing type conversion logic; S5, generating target codes according to the intermediate templates and special processing rules in service requirements through a conversion rule automatic generation engine based on the large model, and deploying the target codes to a specified running environment to complete automatic generation of the data pipeline. According to an embodiment of the present invention, there is provided an electronic apparatus including: Processor, and A memory arranged to store computer executable instructions that, when executed, cause the processor to perform the steps of an automated data pipeline generation method based on a low code platform as described above. According to an embodiment of the present invention, there is provided a storage medium storing computer-executable instructions which, when executed, implement the steps of an automated data pipeline generation method based on a low code platform as described above. By adopting the embodiment of the invention, through the deep combination of the natural language processing technology and the visual modeling technology, business personnel can directly describe the data processing requirement by using the natural language, and the system automatically converts the data processing requirement into an executable data pipeline, thereby effectively reducing the technical th