Search

CN-121858571-B - Intelligent cache processing method for marking data flow in real time

CN121858571BCN 121858571 BCN121858571 BCN 121858571BCN-121858571-B

Abstract

The invention discloses an intelligent cache processing method of a real-time annotation data stream, which belongs to the technical field of cross-domain data processing and comprises the steps of receiving the multi-source annotation data stream, performing incremental cleaning by a time sequence diagram neural network mapping field and calculating similarity, generating a mapping rule, executing incremental cleaning, outputting a structured annotation data stream, extracting data entities, attributes and relations, constructing a heterogeneous cache map and establishing a quaternary composite index, then constructing a hot-warm-cold hierarchical cache architecture, dynamically distributing the data entities based on real-time heat entropy, relation strength and the like, synchronously updating the map and the index, evaluating and identifying low-confidence data through entity confidence, generating correction suggestion feedback cleaning, synchronizing copy parameters of each data source sub-image and hot entity states by means of a security aggregation algorithm, and realizing fine-granularity dynamic access control based on a depth policy evaluation network.

Inventors

  • Dai Keqing
  • TIAN WEI
  • WEI YANCHUN
  • LI JIAZHAO
  • LI JIAMING
  • ZHANG QIANG
  • ZUO HANG

Assignees

  • 吉林云投莱森购数字科技有限公司

Dates

Publication Date
20260508
Application Date
20260311

Claims (10)

  1. 1. The intelligent cache processing method for marking the data stream in real time is characterized by comprising the following steps: s1, receiving a multi-source annotation data stream, mapping annotation fields of different sources into nodes of a time sequence diagram neural network, calculating semantic and structural similarity among the nodes, generating a field mapping rule, executing increment cleaning, and outputting a structured annotation data stream; S2, taking the structured label data stream as input, extracting data entities, entity attributes and relationships among the entities contained in the structured label data stream, constructing a heterogeneous cache map and establishing a multidimensional index; s3, constructing a layered cache architecture comprising a hot layer, a warm layer and a cold layer based on the heterogeneous cache map, dynamically distributing data entities in the heterogeneous cache map to corresponding cache layers, and synchronously updating the state of the heterogeneous cache map and the multidimensional index in real time; S4, identifying marking data based on the association and confidence coefficient characteristics of each data entity in the heterogeneous cache map, generating a data correction suggestion and feeding back the data correction suggestion to an incremental cleaning process so as to update a structured marking data stream and the heterogeneous cache map, and simultaneously, locally maintaining corresponding sub-graph copies of the heterogeneous cache map at each data source node, and synchronizing graph neural network model parameters and hot spot entity states of each sub-graph copy through a secure aggregation algorithm; and S5, performing dynamic access control on the data stored in the hierarchical cache architecture based on the attribute and the associated characteristic of the data entity in the heterogeneous cache map, and feeding back a control decision to S3.
  2. 2. The method for intelligent caching of real-time annotation data stream according to claim 1, wherein S1 specifically comprises: s1.1, creating a time sequence feature node for each labeling field of each source, wherein the attribute of the time sequence feature node comprises a semantic embedded vector of a field name, a statistical feature vector of value range distribution and a time stamp sequence of a field occurrence and change event; S1.2, inputting all time sequence feature nodes in a continuous time window into a time sequence diagram neural network, capturing the time dependence feature of each time sequence feature node by the time gating circulation unit layer, and then aggregating the cross-source structural features among the time sequence feature nodes of different data sources by the diagram attention layer to output a fusion feature vector of each time sequence feature node; S1.3, calculating cross-source semantic affinity between fusion feature vectors of any two time sequence feature nodes, calculating time sequence co-occurrence strength based on co-occurrence relation of the two time sequence feature nodes in a sliding time window, inputting the cross-source semantic affinity and the time sequence co-occurrence strength into a rule generation neural network, and outputting a probabilistic field mapping rule matrix by the rule generation neural network, wherein each element in the field mapping rule matrix represents the confidence weight of mapping one source field to another source field; s1.4, analyzing and aligning original marking fields flowing in real time according to the field mapping rule matrix, directly fusing the data with mapping confidence higher than a preset first threshold, marking the data lower than the preset first threshold but higher than a preset second threshold, temporarily storing the data in a to-be-determined area, triggering an abnormal alarm on the data lower than the preset second threshold, and finally outputting the structured marking data stream.
  3. 3. The intelligent cache processing method for marking a data stream in real time according to claim 1, wherein the constructing process of the heterogeneous cache map in S2 includes: s2.1, identifying data entities from the structured label data stream, and distributing a globally unique entity fingerprint code for each data entity, wherein the entity fingerprint code is generated by combining an entity type hash value and an attribute hash value; s2.2, extracting attribute key value pairs of each data entity, instantiating each attribute key value pair into an attribute node, and linking each attribute node to the data entity to which each attribute node belongs through one edge; S2.3, identifying association relations among different data entities, instantiating each relation into a directed relation connection edge, and endowing the relation connection edge with relation types, establishing time and a relation strength value based on co-occurrence frequency and semantic consistency; S2.4, constructing a heterogeneous cache map based on all the identified data entities, all the generated attribute nodes, all the link edges of the data entities and all the generated relationship connection edges, wherein the heterogeneous cache map is formed by taking the data entities as vertexes, the attribute nodes as auxiliary vertexes, and the link edges between the data entities and the attribute nodes and the relationship connection edges between the data entities as edges.
  4. 4. The intelligent cache processing method for real-time annotation data stream according to claim 3, wherein the multi-dimensional index in S2 is a quaternary composite index structure, and the quaternary composite index structure comprises an attribute inverted index, a relation adjacency index, a time sequence heat index and a sub-graph structure fingerprint index; the attribute inverted index is an inverted list established for keys and values of all attribute nodes and is used for fuzzy matching and range query based on attribute values; The relation adjacency index is used for recording the relation type of all relation connection edges of each data entity, the target data entity and the current relation strength value, and is arranged according to the descending order of the relation strength values; The time sequence heat index dynamically calculates the real-time heat entropy of each data entity by using a time decay function based on the event time stamp of which the data entity is accessed, associated or attribute is modified, and establishes a sequencing index of the real-time heat entropy; And the sub-graph structure fingerprint index calculates a local sub-graph structure in two hops by taking each data entity as a center, and generates sub-graph structure fingerprints with fixed length for similarity retrieval based on the graph structure.
  5. 5. The intelligent cache processing method for real-time annotation data stream according to claim 4, wherein the specific step of S3 comprises: S3.1, dividing a cache storage medium into three logic layers, wherein the logic layers comprise a hot layer, a warm layer and a cold layer; S3.2, dynamically distributing the data entities in the heterogeneous cache map to the partitioned cache layers through a cache position decision function based on the heterogeneous cache map, wherein the cache position decision function takes real-time heat entropy of the target data entities, average relation strength values of association relation joint edges of the real-time heat entropy and the average relation strength values and storage overhead of the target data entities as input parameters to calculate a comprehensive score; s3.3, when any data entity is migrated to the hot layer according to the decision, based on the relation adjacency index, searching out the associated data entity which has a relation connection edge with the hot layer data entity and is currently positioned in the hot layer or the cold layer, and preloading the associated data entity to a preparation buffer area of the hot layer; And S3.4, after the data entity migrates among the cache layers according to S3.2-S3.3, synchronously updating the storage position state of the corresponding data entity in the heterogeneous cache map, and updating the entry associated with the corresponding data entity in the multi-dimensional index.
  6. 6. The intelligent cache processing method for real-time annotation data stream according to claim 5, wherein S3 further comprises an access-driven map heat update mechanism, specifically comprising: S3.5, when the data entity in the cache is successfully accessed, updating the real-time heat entropy of the corresponding data entity according to the time stamp of the access event; s3.6, triggering the associated heat propagation according to the operation type of the access request, wherein the method comprises the following steps: if the operation type is that the attribute of the data entity is read, the real-time heat entropy of the corresponding data entity is transmitted to the attribute node directly connected with the corresponding data entity; If the operation type is traversing the relation joint edge between the data entity and the non-self data entity, carrying out attenuation type propagation on the real-time heat entropy of the corresponding data entity to the associated data entity along the traversed relation joint edge; S3.7, updating the heat change generated in the steps S3.5 and S3.6 into the time sequence heat index in real time; s3.8, reevaluating the cache position of the data entity based on the updated time sequence heat index; And triggering migration operation of the corresponding data entity among the hot layer, the warm layer and the cold layer if the real-time thermal entropy change of any data entity causes the comprehensive score of the buffer position decision function to cross the predefined level threshold.
  7. 7. The method of claim 1, wherein identifying annotation data based on the association and confidence characteristics of each entity in the heterogeneous cache map, generating a data correction suggestion and feeding back to an incremental cleaning process to update the structured annotation data stream and the heterogeneous cache map, comprises: S4.1, maintaining a confidence level track for each attribute node in the heterogeneous cache map, wherein the confidence level track is used for recording attribute node identification, historical value taking, value taking source, frequency of occurrence, last update time stamp and value taking confidence level; S4.2, based on the heterogeneous cache map, each data entity is evaluated by adopting a predefined entity comprehensive confidence evaluation model, wherein the entity comprehensive confidence evaluation model aggregates the consistency of confidence tracks of all attribute nodes of a target data entity, the source diversity of association relation joint edges and the stability characteristics of corresponding data entities in history correction, and outputs an entity confidence coefficient between 0 and 1; S4.3, periodically scanning the heterogeneous cache map, and marking the corresponding data entity as a low-confidence entity and automatically generating a tracing correction proposal when the entity confidence coefficient of any data entity is lower than a preset confidence alarm threshold value; And S4.4, packaging the generated traceability correction proposal as a correction event, and feeding back to an increment cleaning process, wherein the increment cleaning process updates the structured marking data stream according to the correction event to trigger the updating of the heterogeneous cache map.
  8. 8. The intelligent caching method for marking data streams in real time according to claim 1, wherein in S4, corresponding sub-graph copies of the heterogeneous caching map are maintained locally at each data source node, and the graph neural network model parameters and the hot spot entity states of each sub-graph copy are synchronized through a secure aggregation algorithm, comprising: S4.5, each participating data source node cuts out a corresponding sub-graph copy from the global dynamic entity association map according to the data jurisdiction range of the participating data source node for maintenance, wherein the sub-graph copy is a slice of the heterogeneous cache map; S4.6, setting up a synchronization period, and executing, in each period, the data source nodes: training the time sequence diagram neural network by using local data, and homomorphic encrypting the model parameter gradient obtained by calculation after training to form encrypted gradient data; Extracting the first K data entities with highest real-time heat entropy in the sub-graph copy, acquiring entity confidence coefficients and attributes of the data entities, forming a local hot spot entity state snapshot, and encrypting the local hot spot entity state snapshot; Uploading the encrypted gradient data and the encrypted local hot point entity state snapshot to a central coordinator.
  9. 9. The intelligent caching method for marking data streams in real time according to claim 8, wherein in S4, corresponding sub-graph copies of the heterogeneous caching map are maintained locally at each data source node, and the parameters of the graph neural network model and the hot spot entity state of each sub-graph copy are synchronized through a secure aggregation algorithm, further comprising: S4.7, after receiving the encryption information uploaded by all the data source nodes, the central coordinator executes a security aggregation algorithm, which comprises the steps of carrying out security average calculation on encryption gradient data from all the data source nodes, decrypting calculation results, updating a graph neural network model which is a time sequence graph neural network by using the decrypted calculation results, merging the entity state snapshots of the encryption local heat points from all the data source nodes, analyzing to obtain global hot points and confidence distribution states, and sending updated graph neural network model parameters to all the data source nodes; and S4.8, each data source node receives the pattern neural network model parameters issued by the central coordinator, updates the local time sequence pattern neural network and adjusts the state of the related data entity in the sub-pattern copy.
  10. 10. The intelligent caching method for marking a data stream in real time according to claim 1, wherein the specific step of S5 includes: S5.1, when an access request for a target data entity is received, extracting a main body identity, an operation type and an environment token from the context of the access request; S5.2, inquiring the heterogeneous cache map, acquiring all attribute nodes of the target data entity, associated data entities and relation types thereof, and dynamically generating a context feature vector of the access by combining a historical access record of the identity of the main body; S5.3, inputting the generated context feature vector into a depth policy evaluation network and outputting an arbitration decision, wherein the arbitration decision is one of permission, rejection or degradation, and the degradation decision represents a simplified version which is subjected to desensitization and only allows access to a target data entity to be stored in a cold layer; S5.4, performing an arbitration decision, comprising: if the arbitration decision is allowed, granting access rights; if the arbitration decision is degradation or refusal, firstly recording an access event, and then sending a cache degradation instruction aiming at the target data entity to the hierarchical cache architecture, wherein the cache degradation instruction is used for forcedly setting the value of the cache degradation instruction to zero when the real-time heat entropy of the target data entity is updated.

Description

Intelligent cache processing method for marking data flow in real time Technical Field The invention belongs to the technical field of cross-domain data processing, and particularly relates to an intelligent cache processing method for marking data streams in real time. Background Along with the wide application of the data labeling technology in the fields of artificial intelligence, big data analysis and the like, the real-time labeling data flow presents the characteristics of multi-source isomerism and dynamic high frequency, and brings serious challenges to data processing and cache management. In the prior art, the multisource labeling field lacks a precise semantic and structure mapping mechanism, so that the problems of field conflict, non-uniform format and the like are easy to occur, the data structuring efficiency is low, and the real-time processing requirement is difficult to meet. The traditional cache architecture mostly adopts a single-level or static allocation strategy, and the association relation and real-time heat change among data entities are not fully considered, so that cache resource waste is caused, and access response delay is high. Meanwhile, the existing scheme lacks a dynamic evaluation and correction mechanism for the confidence coefficient of the marked data, and low-confidence data easily flows into a subsequent process to influence the reliability of data application. In addition, in the distributed data source scene, privacy leakage risks exist in the data synchronization process, access control is mainly coarse-grained management, authority cannot be dynamically adjusted according to data attributes, main identity and the like, and data security and access flexibility are difficult to achieve. These problems limit the processing efficiency, data quality and security compliance of the real-time labeling data stream together, and a high-efficiency and intelligent cache processing scheme is needed to solve the problems. Disclosure of Invention Aiming at the defects of the prior art, the intelligent cache processing method for the real-time annotation data stream is provided, the multi-source annotation data stream is received, the increment cleaning is performed by generating mapping rules through a time sequence diagram neural network mapping field and calculating similarity, the structured annotation data stream is output, a data entity, attributes and relations are extracted, a heterogeneous cache map is constructed, a quaternary composite index is built, a heat-temperature-cold hierarchical cache architecture is then constructed, the data entity is dynamically allocated based on real-time heat entropy, relation strength and the like, the map and the index are synchronously updated, low-confidence data are evaluated and identified through entity confidence, correction suggestion feedback cleaning is generated, the sub-image copy parameters of each data source and the hot entity state are synchronized through a security aggregation algorithm, fine granularity dynamic access control is realized based on a depth policy evaluation network, and the method improves the cache efficiency, the data reliability and the access security of the multi-source annotation data and is suitable for a real-time data stream processing scene. In order to achieve the above purpose, the present invention provides the following technical solutions: The intelligent cache processing method for marking the data stream in real time comprises the following steps: s1, receiving a multi-source annotation data stream, mapping annotation fields of different sources into nodes of a time sequence diagram neural network, calculating semantic and structural similarity among the nodes, generating a field mapping rule, executing increment cleaning, and outputting a structured annotation data stream; S2, taking the structured label data stream as input, extracting data entities, entity attributes and relationships among the entities contained in the structured label data stream, constructing a heterogeneous cache map and establishing a multidimensional index; s3, constructing a layered cache architecture comprising a hot layer, a warm layer and a cold layer based on the heterogeneous cache map, dynamically distributing data entities in the heterogeneous cache map to corresponding cache layers, and synchronously updating the state of the heterogeneous cache map and the multidimensional index in real time; S4, identifying marking data based on the association and confidence coefficient characteristics of each data entity in the heterogeneous cache map, generating a data correction suggestion and feeding back the data correction suggestion to an incremental cleaning process so as to update a structured marking data stream and the heterogeneous cache map, and simultaneously, locally maintaining corresponding sub-graph copies of the heterogeneous cache map at each data source node, and synchronizing graph neural net