CN-122019601-A - Data query method and system based on semantic recognition

CN122019601ACN 122019601 ACN122019601 ACN 122019601ACN-122019601-A

Abstract

The invention discloses a data query method and system based on semantic recognition, and relates to the technical field of data processing. The method comprises the steps of synchronously updating incremental change logs to a dynamic knowledge graph comprising an entity relation library and a text index library through incremental change logs of multi-source heterogeneous service data based on service dimensions and data type dimensions, sequentially searching based on a multi-level cache system comprising a high-frequency dynamic cache layer and a low-frequency historical cache layer in response to a query request, triggering a semantic analysis flow if the query request is missed, analyzing and generating a structured query instruction by utilizing a fine-tuned pre-training language model, executing search enhancement generation operation based on a graph neural network, screening candidate evidence chains, inputting the candidate evidence chains into a generation model, generating a final query reply by combining a dynamic desensitization strategy corresponding to user rights, and asynchronously backfilling the multi-level cache system. The invention can improve the real-time performance of data query, the accuracy of semantic understanding and the safety of data.

Inventors

LI YINGYING
Liu Fukuo
Zu Xia
ZHOU WENWEN
XIONG NENG
LIU JUNLIANG

Assignees

重庆数字资源集团有限公司

Dates

Publication Date: 20260512
Application Date: 20251222

Claims (10)

1. A data query method based on semantic recognition, the method comprising: Capturing incremental change logs of multi-source heterogeneous service data through a real-time data pipeline based on a double-strategy partitioning mechanism of service dimension and data type dimension, and synchronously updating the incremental change logs to a dynamic knowledge graph containing an entity relation library and a text index library by utilizing a vector clock synchronization strategy to construct a data base of semantic query; responding to a natural language query request initiated by a user, sequentially searching the high-frequency dynamic cache layer and the low-frequency history cache layer based on a multi-level cache system comprising the high-frequency dynamic cache layer and the low-frequency history cache layer, directly returning a query result if cache hit, and triggering a semantic analysis flow if cache miss; Performing multi-level semantic disambiguation and intention recognition on the natural language query request by utilizing the fine-tuned pre-training language model, and analyzing to generate a structured query instruction containing entity objects, intention types and constraint conditions; Based on the structured query instruction, performing search enhancement generation operation based on a graph neural network in the dynamic knowledge graph, and screening a candidate evidence chain according to the timeliness weight of the evidence; And inputting the candidate evidence chain into a generative model, generating a final query reply by combining a dynamic desensitization strategy corresponding to the user authority, and asynchronously backfilling the final query reply to the multi-level cache system according to an access frequency strategy.
2. The semantic recognition-based data query method of claim 1, wherein capturing incremental change logs of multi-source heterogeneous business data through a real-time data pipeline based on a dual policy partition mechanism of business dimension and data type dimension comprises: Monitoring a binary log of a source database in real time through a database log connector, and capturing incremental operation data including an inserting operation, an updating operation and a deleting operation; determining a basic partition by adopting a service identifier hash modulus, overlapping a data type hash modulus to perform secondary partition, and routing the incremental operation data to a corresponding message queue partition; counting real-time throughput of each partition based on the sliding window, and triggering a dynamic partition migration strategy to migrate hot spot business data to an idle partition when the throughput of a specific partition is detected to exceed a hot spot threshold value in a preset time window; And in the data transmission process, a strategy of combining future time filtering with a business white list is adopted to check the time stamp in the incremental operation data, and abnormal future time data exceeding a preset tolerance window is removed, but the pre-examination batch flow data in the white list is reserved.
3. The semantic recognition-based data query method of claim 2, wherein synchronously updating the incremental change log to a dynamic knowledge-graph comprising an entity-relationship library and a text-index library using a vector clock synchronization strategy comprises: constructing a vector clock triplet comprising an engine type, an operation sequence number and a time stamp, and taking the vector clock triplet as a certificate for data consistency verification; routing structured entity relation data in the incremental operation data to a graph database as the entity relation library, and routing unstructured text data in the incremental operation data to a search engine as the text index library; When writing operation is executed, the operation sequence number of the corresponding engine is automatically increased and the current system time stamp is synchronously updated; When the synchronous conflict between the entity relation library and the text index library of the same data object is detected, comparing the operation sequence numbers and time stamps of vector clocks respectively corresponding to the entity relation library and the text index library, adopting a last-writing winning strategy to force unified data state, and reserving an operation track for tracing.
4. The semantic recognition-based data query method according to claim 3, further comprising business compartment dynamic segmentation and relation extraction on the dynamic knowledge graph, and specifically comprising: performing fragment storage on the entity relation library according to the service division codes, and executing zero-loss migration flow comprising premigration double writing, switching verification and backup cleaning when the service division is added or merged; Adopting an entity relation extraction network based on a pre-training language model, strengthening the service entity identification capacity through a mask language model task, and introducing an antagonistic sample to finely adjust a relation classifier; And extracting the business entity and the association relation from the unstructured text data by utilizing the entity relation extraction network after fine adjustment, and updating the extraction result increment to the entity relation library by a double trigger mechanism combining time driving and event driving.
5. The semantic recognition-based data query method of claim 1, wherein performing multi-level semantic disambiguation and intent recognition on the natural language query request using a hinted pre-trained language model comprises: Analyzing geographic position nouns in the natural language query request based on geographic information system service, determining a service area range related to query, and starting a spam strategy if positioning fails; Combining the historical interaction context of the user with the portrait characteristics of the user, and carrying out semantic completion and reference resolution on fuzzy reference words in the natural language query request; Inputting the disambiguated query text into an intention recognition model finely tuned based on RoBERTa architecture, and recognizing the business query intention of a user, wherein the intention recognition model carries out incremental training through a business corpus containing long tail intention; And if the identified intention confidence is lower than a preset threshold, triggering a similar problem recommendation logic, and recording the low-confidence query for the subsequent model optimization.
6. The semantic recognition-based data query method of claim 1, wherein performing a graph neural network-based search enhancement generation operation in the dynamic knowledge graph based on the structured query instruction comprises: utilizing GRAPHSAGE a neural network model, fusing entity attribute features and text sentence vector features in the dynamic knowledge graph to generate node embedding vectors; Calculating cosine similarity between the vector representation of the structured query instruction and the node embedded vector, and searching out Top-K associated nodes with similarity higher than a preset threshold value as an initial evidence set; constructing a multi-dimensional evidence evaluation system, and calculating the comprehensive score of each evidence in the initial evidence set according to the authority of the evidence sources, the correlation of the content and the timeliness of the release time; And introducing a timeliness weight formula, carrying out weight reduction or filtering on evidence with time conflict, sorting and screening out candidate evidence chains according to the comprehensive scores, and endowing the newly issued normative file with higher weight by the timeliness weight formula.
7. The semantic recognition-based data query method of claim 1, wherein inputting the candidate evidence chain into a generative model, and generating a final query reply in combination with a dynamic desensitization policy corresponding to user rights, comprises: Embedding dynamic desensitization instructions in a prompt word template of a generated model, wherein the dynamic desensitization instructions define processing rules of data with different sensitivity levels; Identifying the current user authority level of the query, and matching the corresponding desensitization strategy according to the user authority level; For the sensitive information of the individual, the mask or the cut-off mode is adopted to carry out high-strength desensitization, and for the sensitive data of the enterprise, the aggregation statistics or the interval display mode is adopted to carry out processing; And inputting the candidate evidence chain and the prompt word embedded with the desensitization instruction into a generating model to generate a natural language reply text meeting the compliance requirement.
8. The semantic recognition-based data query method according to claim 1, wherein the sequentially retrieving the high frequency dynamic cache layer and the low frequency history cache layer based on the multi-level cache system including the high frequency dynamic cache layer and the low frequency history cache layer comprises: defining data with access times exceeding a first threshold value and update frequency higher than a second threshold value in a latest preset time period as high-frequency dynamic data, storing the high-frequency dynamic data in the high-frequency dynamic cache layer based on a memory database, and setting the survival time of a short period; defining data with access times lower than a first threshold value or update frequency lower than a second threshold value in a latest preset time period as low-frequency historical data, storing the low-frequency historical data in the low-frequency historical cache layer based on a search engine, and adopting a daily update mechanism; when the user inquiry request arrives, preferentially inquiring the high-frequency dynamic cache layer, and inquiring the low-frequency history cache layer if the user inquiry request does not hit; And if the target data is hit in the low-frequency history cache layer and the access characteristic of the target data meets the potential high-frequency condition, automatically lifting and backfilling the target data to the high-frequency dynamic cache layer.
9. The semantic recognition-based data query method of claim 8, further comprising dynamically adjusting resources of the multi-level cache system: Carrying out serialization storage on the cache data by adopting a binary compression format; monitoring the memory utilization rate of the high-frequency dynamic cache layer in real time, and automatically triggering the capacity expansion flow when the memory utilization rate exceeds a preset alarm stop line and lasts for a preset time period; Monitoring the fragment inquiry performance of the low-frequency historical cache layer in real time, and automatically adjusting the number of fragments and balancing the data distribution when the single fragment inquiry delay exceeds a preset performance threshold; Based on historical query log analysis, a mixed elimination strategy of combining a least recently used algorithm with a least frequently used algorithm is adopted to clean out expired or cold data in the high-frequency dynamic cache layer.
10. A data query system based on semantic recognition, the system comprising: The dynamic map construction module is used for capturing an increment change log of multi-source heterogeneous service data through a real-time data pipeline based on a double-strategy partition mechanism of a service dimension and a data type dimension, and synchronously updating the increment change log to a dynamic knowledge map comprising an entity relation library and a text index library by utilizing a vector clock synchronization strategy; The multi-level cache retrieval module is used for responding to a natural language query request initiated by a user, sequentially retrieving the high-frequency dynamic cache layer and the low-frequency history cache layer based on a multi-level cache system comprising the high-frequency dynamic cache layer and the low-frequency history cache layer, directly returning a query result if cache hit, and triggering a semantic analysis flow if cache miss; The semantic deep analysis module is used for carrying out multistage semantic disambiguation and intention recognition on the natural language query request by utilizing the fine-tuned pre-training language model, and analyzing and generating a structured query instruction containing entity objects, intention types and constraint conditions; The map enhancement generation module is used for executing search enhancement generation operation based on a graph neural network in the dynamic knowledge map based on the structured query instruction, and screening candidate evidence chains according to the evidence timeliness weight; And the safety reply processing module is used for inputting the candidate evidence chain into a generating model, generating a final query reply by combining a dynamic desensitization strategy corresponding to the user authority, and asynchronously backfilling the final query reply to the multi-level cache system according to an access frequency strategy.

Description

Data query method and system based on semantic recognition Technical Field The invention relates to the technical field of data processing, in particular to a data query method and system based on semantic recognition. Background With the deep advancement of digital transformation, various large-scale management institutions and enterprises and public institutions accumulate massive multi-source heterogeneous data, and the data cover various forms such as structured business registration information, circulation records, unstructured normative files, business guides and the like. In order to implement management and query of these data, the current mainstream technical solution generally adopts a hierarchical architecture of "database+search engine". At the data processing level, mainly relies on extraction, conversion and loading (ETL) batch processing tasks triggered at regular time every day to synchronize the change data of the business system to the query library, and the data update period is usually locked at the level of days or hours. At the query interaction level, the system often provides a fixed application program interface (Application Programming Interface, API) or a keyword matching-based search interface, and the user needs to precisely input specific fields or predefined words to acquire information. However, this conventional architecture exposes multiple technical bottlenecks in the face of increasingly complex instant query requirements. Firstly, a batch-processing-based data synchronization mechanism causes serious lag in data timeliness, real-time service query requirements in peak periods or burst scenes cannot be supported, and multi-source heterogeneous data lacks efficient consistency verification means in a fusion process, so that version conflict is easy to generate. Secondly, the existing question-answering or query engine lacks deep semantic understanding capability, relies on static rule base or shallow text matching, and cannot accurately analyze fuzzy indication, multiple intentions and cross-domain logic in natural language, so that query results are inaccurate or question answering is not performed. In addition, under the high concurrency access scene, the system cache strategy is single, intelligent hierarchical scheduling aiming at cold and hot data is lacked, and resource congestion and response timeout are easy to cause. Finally, in terms of data security and compliance, the existing scheme mostly adopts static data display rules, so that the desensitization strength of sensitive information is difficult to dynamically adjust according to authority levels of different users, and the problem that data is unavailable due to data leakage risk or excessive management and control exists. In summary, the existing service data query method cannot accurately and efficiently meet the real-time service query requirement due to overdependence on batch processing and keyword matching, and has insufficient security management and control flexibility. Therefore, an intelligent data query technical scheme capable of integrating real-time data processing, deep semantic understanding and dynamic security management and control is needed. Disclosure of Invention The invention provides a data query method and a data query system based on semantic recognition, which solve the problems of poor timeliness, weak semantic understanding, slow concurrent response and stiff security control in the prior art. In order to achieve the above purpose, the embodiment of the present invention adopts the following technical scheme: In a first aspect, an embodiment of the present invention provides a data query method based on semantic recognition, where the method includes: Capturing incremental change logs of multi-source heterogeneous service data through a real-time data pipeline based on a double-strategy partitioning mechanism of service dimension and data type dimension, synchronously updating the incremental change logs to a dynamic knowledge graph containing an entity relation library and a text index library by utilizing a vector clock synchronization strategy, and constructing a data base of semantic query; Responding to a natural language query request initiated by a user, sequentially searching the high-frequency dynamic cache layer and the low-frequency history cache layer based on a multi-level cache system comprising the high-frequency dynamic cache layer and the low-frequency history cache layer, directly returning a query result if cache hit, and triggering a semantic analysis flow if cache miss; Performing multi-level semantic disambiguation and intention recognition on the natural language query request by utilizing the fine-tuned pre-training language model, and analyzing to generate a structured query instruction containing entity objects, intention types and constraint conditions; based on the structured query instruction, performing search enhancement generation operation based