Search

CN-121996675-A - Text-to-SQL generation method and device based on enhanced atlas embedding technology

CN121996675ACN 121996675 ACN121996675 ACN 121996675ACN-121996675-A

Abstract

The invention discloses a text-to-SQL generating method and device based on an enhanced atlas embedding technology, comprising the steps of extracting entities such as tables, fields, condition values and the like from natural language query, simultaneously acquiring database modes and business constraints, constructing a multi-relation heterogeneous knowledge atlas, constructing four types of nodes including tables, fields, values and constraints, establishing semantic edges and injecting attributes, embedding the enhanced atlas, obtaining sparse embedded vectors through pre-training word embedding, KG-Adapter, self-attention mechanism and L2 regularization-pruning, aligning the embedded vectors with the text vectors in a cross-modal manner, inputting a trimmed Large Language Model (LLM) and generating a compliance SQL. The invention can effectively analyze the natural language query by utilizing the database structure information, improves the accuracy and efficiency of SQL generation, and simultaneously provides reliable and intelligent support for complex database query.

Inventors

  • YU LANG
  • YANG TIANHU
  • ZHANG LEI
  • Su Bufa

Assignees

  • 厦门市美亚柏科信息安全研究所有限公司

Dates

Publication Date
20260508
Application Date
20251216

Claims (14)

  1. 1. A text-to-SQL generation method based on an enhanced atlas embedding technology is characterized by comprising the following steps: S1, entity extraction, namely extracting table names, fields, condition values and functions in natural language query into a structural entity subset at the natural language side through text preprocessing, named entity recognition, relation extraction and standardization; the method comprises the steps of obtaining database mode information and business constraint to form a database side structured entity subset, merging the natural language side structured entity subset and the database side structured entity subset to obtain a complete structured entity set; s2, constructing a multi-relation heterogeneous knowledge graph, namely mapping the entity set into table nodes, field nodes, value nodes and constraint nodes, constructing 'table-field', 'field-value', 'field-function', 'field-constraint' semantic edges, and injecting data types, value range and business rule attributes into the nodes/edges to obtain the multi-relation heterogeneous knowledge graph; S3, enhancing the map embedding, namely mapping the map nodes into initial vectors through pre-training word embedding, updating node representation through KG-Adapter layers and a map neural network, combining a self-attention mechanism with multi-layer perceptron enhancement features, obtaining sparse map embedding vectors through L2 regularization and pruning processing, and S4, cross-modal alignment and SQL generation, namely aligning the sparse map embedded vector with a text vector of a natural language query to form cross-modal joint representation, and inputting the supervised fine-tuned LLM to generate an SQL statement conforming to SQL grammar specification and database service constraint.
  2. 2. The method for generating text-to-SQL according to claim 1, wherein the text preprocessing in step S1 includes word segmentation, part-of-speech tagging, and word de-disabling operations, and the preprocessing for preserving database query key information for orientation, and ensuring that the extracted entity subset is directly matched with the database query requirement, specifically includes: The word segmentation divides the natural language grammar into query texts as semantic units, part-of-speech labels reserve the parts of speech of key verbs, nouns and numerical parts of speech, and removes words with no database semantic association including 'in', 'and' out of order.
  3. 3. The method of claim 1, wherein the named entity recognition in step S1 is performed for entity types related to the database query, the entity types including a time entity, a numeric entity, a person entity, a place name entity, and a function entity, and the recognized entity types output a mapping pair in the format of { < entity keyword >: < database mapping object > }.
  4. 4. The method for generating text-to-SQL according to claim 1, wherein the database schema information in step S1 includes a table structure, a field data type, a primary foreign key association relationship and index information in a database, and the service constraints include a field value range constraint, a table association constraint and a service logic constraint.
  5. 5. The method for generating text-to-SQL (structured query language) according to claim 1, wherein the step S2 is characterized in that the data types are injected into the nodes, specifically, the table node is injected with the identification of the affiliated database, the field node is injected with the corresponding data types and the attribute of ' whether non-null ' is unique ', the value node is injected with the data types, and the constraint node is injected with the constraint types.
  6. 6. The text-to-SQL generating method according to claim 1, wherein the value range in the step S2 is obtained through normalization processing, specifically, the value range is mapped to the [0,1] interval by adopting min-max normalization for the value class value node, and normalization is performed after the value class value node is converted into the timestamp, so that the consistency of the value range is ensured.
  7. 7. The method for generating text-to-SQL according to claim 1, wherein the pre-training word embedding in step S3 is implemented by using a pre-training model of a large language model LLM, and the pre-training model includes a BERT series model, a Qwen series model, or a GPT series model, so as to ensure that the node initial vector is consistent with the text vector dimension of the LLM.
  8. 8. The text-to-SQL generating method according to claim 1, wherein in the step S3, the KG-Adapter layer integrates the output of the graph neural network in an instant tuning mode, specifically, the node structure feature generated by the graph neural network is embedded as a Prompt, and is input into the KG-Adapter layer after being spliced with the initial vector embedded by the pre-training word, and the fusion of the structure feature and the semantic feature is realized through 1-2 layers of linear transformation.
  9. 9. The method for generating text-to-SQL according to claim 1, wherein the L2 regularization in step S3 is performed by adding an embedded vector sum-of-squares term to a Loss function expressed as loss=loss base +L reg , Where Loss base is the base task penalty, L reg is an entry in the penalty function, λ is the regularization coefficient, and v i represents the i-th embedding vector.
  10. 10. The method for generating text-to-SQL (structured query language) according to claim 1, wherein the pruning process in the step S3 adopts a threshold screening mechanism, specifically comprises the steps of presetting a vector dimension weight threshold, calculating the weight absolute value of each dimension of each embedded vector, eliminating the dimension with the weight absolute value smaller than the threshold, reserving a key semantic dimension to reduce the thinned embedded vector dimension, presetting a threshold of gamma=1e-4, and pruning the dimension if |v i | < gamma=1e-4, wherein the value range is 0.01-0.05.
  11. 11. The method for generating text-to-SQL (structured query language) according to claim 1, wherein the training sample for supervised fine tuning in the step S4 comprises two types of labeling data, namely a first type is a natural language query-map answer node ID pair for optimizing semantic association capability of LLM (logical link model) to map nodes, a second type is a natural language query-standard SQL sentence pair for optimizing SQL grammar generation capability of LLM, and the fine tuning process adopts a cross entropy loss function, the iteration times are 50-100 rounds, and the learning rate is 1e-5 e-5.
  12. 12. A text-to-SQL generation apparatus based on an enhanced atlas embedding technology, which is configured to implement the text-to-SQL generation method according to any one of claims 1 to 11, and includes: The entity extraction module is used for extracting table names, fields, condition values and functions in natural language query into a natural language side structured entity subset through text preprocessing, named entity recognition, relation extraction and standardization; the method comprises the steps of obtaining database mode information and business constraint to form a database side structured entity subset, merging the natural language side structured entity subset and the database side structured entity subset to obtain a complete structured entity set; The knowledge graph construction module is used for mapping the entity set into table nodes, field nodes, value nodes and constraint nodes, constructing 'table-field', 'field-value', 'field-function', 'field-constraint' semantic edges, and injecting data types, value range and business rule attributes into the nodes/edges to obtain a multi-relation heterogeneous knowledge graph; the enhancement map embedding module is used for mapping the map nodes into initial vectors through pre-training word embedding, updating node representation through KG-Adapter layers and a map neural network, combining a self-attention mechanism with multi-layer perceptron enhancement characteristics, obtaining sparse map embedding vectors through L2 regularization and pruning processing, and And the model fine tuning module is used for aligning the sparse map embedding vector with a text vector of the natural language query to form cross-modal joint representation, and inputting the supervised fine tuned LLM to generate an SQL sentence conforming to SQL grammar specification and database service constraint.
  13. 13. An electronic device, comprising: One or more processors; a storage means for storing one or more programs; When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1 to 11.
  14. 14. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 11.

Description

Text-to-SQL generation method and device based on enhanced atlas embedding technology Technical Field The invention belongs to the technical field of intersection of Natural Language Processing (NLP) and intelligent database query, and particularly relates to a text-to-SQL (structured query language) generation method and device based on an enhanced graph embedding technology. Background Text-to-SQL (Text-to-SQL) technology is a core bridge connecting natural language and databases, and aims to enable non-technical users to directly query the databases through natural language without grasping SQL grammar. In recent years, with the methods of KG-SQL, graPPa and the like, the process of data query is simplified from complicated SQL to more visual natural language format through joint learning of database structures and natural language representation. The innovation facilitates user-friendly data query and analysis, realizes democratization of access to the database system, improves data processing efficiency and widens the application field of the data processing efficiency. The traditional text-to-SQL method in the current mainstream mainly has the following defects that (1) a sequence-to-sequence model is depended, only text information is utilized, and the relation between a database table structure and a field is ignored. In addition, although the partial model introduces the database mode information, the partial model mainly depends on surface layer characteristics such as table names, field names and the like, and cannot deeply mine semantic association and business logic constraint implicit between tables. (2) Modeling information of structural features such as foreign key relation, value range constraint and the like is not comprehensive. The semantic meaning of foreign key constraint is ignored because of insufficient utilization of the map structure information and insufficient semantic understanding of the database by the model. Therefore, a text-to-SQL method that can deeply fuse database structure semantics with LLM, reduce illusions, and adapt to complex scenes is needed. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a text-to-SQL generating method and a device based on an enhanced atlas embedding technology, which are used for generating a four-stage flow through entity extraction, atlas modeling, enhanced embedding and cross-modal generation, and converting the database semantics into a structured atlas, embedding the structured atlas into a vector space, and guiding the LLM to generate the compliance SQL after aligning with the text vector. In a first aspect, the present invention provides a text-to-SQL generating method based on an enhanced atlas embedding technology, the method comprising the steps of: S1, entity extraction, namely extracting table names, fields, condition values and functions in natural language query into a structural entity subset at the natural language side through text preprocessing, named entity recognition, relation extraction and standardization; the method comprises the steps of obtaining database mode information and business constraint to form a database side structured entity subset, merging the natural language side structured entity subset and the database side structured entity subset to obtain a complete structured entity set; s2, constructing a multi-relation heterogeneous knowledge graph, namely mapping the entity set into table nodes, field nodes, value nodes and constraint nodes, constructing 'table-field', 'field-value', 'field-function', 'field-constraint' semantic edges, and injecting data types, value range and business rule attributes into the nodes/edges to obtain the multi-relation heterogeneous knowledge graph; S3, enhancing the map embedding, namely mapping the map nodes into initial vectors through pre-training word embedding, updating node representation through KG-Adapter layers and a map neural network, combining a self-attention mechanism with multi-layer perceptron enhancement features, obtaining sparse map embedding vectors through L2 regularization and pruning processing, and S4, cross-modal alignment and SQL generation, namely aligning the sparse map embedded vector with a text vector of a natural language query to form cross-modal joint representation, and inputting the supervised fine-tuned LLM to generate an SQL statement conforming to SQL grammar specification and database service constraint. Preferably, the text preprocessing in the step S1 comprises word segmentation, part-of-speech tagging and word removal and deactivation operation, wherein the word segmentation is used for directly matching an extracted entity subset with a database query requirement, and specifically comprises the steps of dividing a natural language grammar into query text into semantic units, the part-of-speech tagging is used for preserving the parts of speech of key verbs, nouns and numerical value parts of spee