CN-121278004-B - Storage strategy optimization method and device based on data heat and data blood margin

CN121278004BCN 121278004 BCN121278004 BCN 121278004BCN-121278004-B

Abstract

The embodiment of the application provides a storage strategy optimization method and a storage strategy optimization device based on data heat and data blood edges, which realize dynamic optimization of a storage strategy by innovatively designing a multidimensional heat calculation model and analyzing node attributes and annotating an intentional network. And constructing a directed graph structure, and establishing a reliable dependency relationship management system by combining blood margin analysis. And a hierarchical storage strategy is introduced, and the storage efficiency is ensured through heat division and dependency optimization. The method effectively solves the defects of the traditional technology in aspects of heat evaluation, blood margin analysis, storage optimization and the like, and provides technical support for software system storage management.

Inventors

SU HUIMIN
LI QIAN
WANG ZHIHAO
GUO HONGYU
JIANG XINYU

Assignees

中国电子科技集团公司第十五研究所

Dates

Publication Date: 20260512
Application Date: 20251010

Claims (8)

1. A method for optimizing a storage policy based on data heat and data blood edges, the method comprising: Collecting data packet size, communication frequency, data type and timestamp information of a node in a software system, counting access frequency, access time and user behavior of the node, calculating access frequency heat, time attenuation heat and user behavior heat of the node, and combining the access frequency heat, the time attenuation heat and the user behavior heat to generate a node heat value; Constructing a directed graph taking the nodes as vertexes and the dependency relationship among the nodes as edges, taking the node heat value as node attribute, taking the computing resource consumption of the nodes as edge weight, and performing iterative training on the directed graph by using a graph attention network to generate a dynamic data blood-edge graph, wherein the method comprises the steps of collecting input and output data flows, data processing tasks and communication relationship among the nodes in a software system, constructing a node dependency matrix based on the data flows and the communication relationship, converting the node dependency matrix into an adjacent matrix, establishing a directed connection relationship among the nodes, and adding the node heat value as the attribute characteristic of the nodes; monitoring CPU occupation time, memory usage and IO operation times in the data transmission process between nodes, carrying out weighted combination on the CPU occupation time, the memory usage and the IO operation times, calculating to obtain resource consumption coefficients between the nodes, taking the resource consumption coefficients as weight values of directed edges to generate a weighted dependency graph, inputting the node attribute characteristics and the edge weight values into a graph attention network, calculating attention coefficients between the nodes, carrying out weighted aggregation on the characteristics of adjacent nodes according to the attention coefficients, generating characterization vectors of the nodes through nonlinear transformation, constructing multi-head attention layers of the nodes based on the characterization vectors, splicing and normalizing the output of the multi-head attention layers, carrying out iterative update on the normalized node characterization, calculating loss function values of node characterization, adjusting the parameters of the graph attention network based on the loss function values, repeatedly executing a parameter adjustment process until the loss function values converge, applying the completed graph attention network to dynamic update of the node characterization, outputting dynamic data blood-margin map; Dividing the nodes into cold nodes, wen Jiedian and hot nodes based on the heat value of the nodes in the dynamic data blood-edge map, adopting a priority caching strategy for the hot nodes, compressing and storing the cold nodes, reserving the direct dependency relationship between the cold nodes and the hot nodes and between the cold nodes and the hot nodes, deleting the indirect dependency relationship of the cold nodes, and dynamically updating the node characterization vector of the graph-annotation force network based on the heat value of the nodes.
2. The method for optimizing a storage policy based on data popularity and data lineage according to claim 1, wherein the collecting data packet size, communication frequency, data type and timestamp information of a node in a software system, counting access frequency, access time and user behavior of the node, and calculating access frequency popularity, time decay popularity and user behavior popularity of the node includes: Reading a software system running log, extracting node identifiers, data packet sizes, communication frequencies, data types and timestamp information from the running log, establishing a node data index matrix, dividing the node data index matrix according to a preset time window, generating a multidimensional data feature sequence, and carrying out normalization processing on the multidimensional data feature sequence; And counting the access frequency, the access time and the user stay time of the nodes in the time window, dividing the access frequency by the maximum access frequency to obtain access frequency heat, calculating the difference between the current time and the latest access time, obtaining time attenuation heat based on an attenuation function, dividing the weighted sum of the user stay time and the clicking times by the maximum weighted sum to obtain user behavior heat, and linearly combining the access frequency heat, the time attenuation heat and the user behavior heat to obtain a comprehensive heat value.
3. The method of claim 1, wherein the combining the access frequency heat, time decay heat, and user behavior heat to generate a node heat value comprises: Calculating historical heat value distribution of the nodes in a plurality of time windows, determining weight coefficients of access frequency heat, time attenuation heat and user behavior heat based on the historical heat value distribution, multiplying the weight coefficients with corresponding heat values respectively and summing the weight coefficients to obtain initial heat values of the nodes, and constructing a node heat evaluation matrix; And carrying out feature decomposition on the node heat evaluation matrix, extracting a principal component feature vector, constructing a heat mapping function based on the principal component feature vector, inputting the initial heat value into the heat mapping function for normalization processing, generating a final heat value of the node, and taking the final heat value as a heat attribute of the node.
4. The method of claim 1, wherein the dividing the nodes into cold nodes, wen Jiedian and hot nodes based on the heat value of the nodes in the dynamic data blood-edge map comprises: Extracting the heat value of a node from the dynamic data blood-edge map, establishing a probability distribution model of the heat value, calculating the mean value and standard deviation of the heat value, setting a heat threshold interval based on the mean value and standard deviation, marking the node with the heat value higher than an upper threshold as a heat node, marking the node with the heat value between the upper threshold and the lower threshold as Wen Jiedian, and marking the node with the heat value lower than the lower threshold as a cold node; respectively constructing a sub-graph structure for the hot node, the warm node and the cold node, calculating connectivity indexes and node importance scores of the sub-graph, establishing a node state transition rule based on the connectivity indexes and the importance scores, dynamically adjusting cold and hot attribute marks of the nodes according to the state transition rule, and updating a hierarchical structure of a data blood-margin map.
5. The method for optimizing a storage policy based on data heat and data blood edges according to claim 1, wherein the adopting a priority cache policy for the hot node, performing compression storage for the cold node, reserving direct dependency of the cold node on the hot node and the warm node, deleting indirect dependency of the cold node, dynamically updating a node characterization vector of the graph-note-force network based on the heat value of the node, comprises: Constructing a multi-level cache structure, storing the hot node data in a cache, encoding the cold node data by adopting a lossless compression algorithm, transferring the compressed cold node data to a low-speed storage device, scanning the dependency relationship of the cold node, reserving a first-order adjacency relationship between the hot node and the hot node, removing the connecting edges between the cold nodes, and updating the adjacency matrix of the dependency relationship graph; and acquiring a real-time heat value of the node, adjusting a weight coefficient of the node in the graph attention network based on the real-time heat value, recalculating the attention distribution of the node, carrying out weighted combination on the attention distribution and the historical characterization vector of the node, generating an updated characterization vector of the node, and inputting the updated characterization vector into the graph attention network for online learning.
6. A storage policy optimization device based on data heat and data blood edges, the device comprising: The node heat determining module is used for collecting the data packet size, the communication frequency, the data type and the time stamp information of the nodes in the software system, counting the access frequency, the access time and the user behavior of the nodes, calculating the access frequency heat, the time attenuation heat and the user behavior heat of the nodes, and combining the access frequency heat, the time attenuation heat and the user behavior heat to generate a node heat value; the blood-edge map construction module is used for constructing a directed graph taking the nodes as vertexes and the dependency relationship among the nodes as edges, taking the node heat value as a node attribute, taking the calculation resource consumption of the nodes as edge weight, and performing iterative training on the directed graph by using a graph attention network to generate a dynamic data blood-edge map, and comprises the steps of collecting input and output data flows, data processing tasks and communication relationship among the nodes in a software system, constructing a node dependency matrix based on the data flows and the communication relationship, converting the node dependency matrix into an adjacent matrix, establishing a directed connection relationship among the nodes, and adding the node heat value as an attribute characteristic of the nodes; monitoring CPU occupation time, memory usage and IO operation times in the data transmission process between nodes, carrying out weighted combination on the CPU occupation time, the memory usage and the IO operation times, calculating to obtain resource consumption coefficients between the nodes, taking the resource consumption coefficients as weight values of directed edges to generate a weighted dependency graph, inputting the node attribute characteristics and the edge weight values into a graph attention network, calculating attention coefficients between the nodes, carrying out weighted aggregation on the characteristics of adjacent nodes according to the attention coefficients, generating characterization vectors of the nodes through nonlinear transformation, constructing multi-head attention layers of the nodes based on the characterization vectors, splicing and normalizing the output of the multi-head attention layers, carrying out iterative update on the normalized node characterization, calculating loss function values of node characterization, adjusting the parameters of the graph attention network based on the loss function values, repeatedly executing a parameter adjustment process until loss function values converge, applying the drawing meaning network after training to dynamic updating of node characterization, and outputting dynamic data blood-margin map; And the strategy storage module is used for dividing the nodes into cold nodes, wen Jiedian and hot nodes based on the heat value of the nodes in the dynamic data blood-edge map, adopting a priority caching strategy for the hot nodes, carrying out compression storage on the cold nodes, reserving the direct dependency relationship between the cold nodes and the hot nodes and the temperature nodes, deleting the indirect dependency relationship of the cold nodes, and dynamically updating the node characterization vector of the graph annotation force network based on the heat value of the nodes.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the data heat and data blood lineage based storage policy optimization method according to any one of claims 1 to 5 when the program is executed.
8. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the data heat and data blood edge based storage policy optimization method according to any of claims 1 to 5.

Description

Storage strategy optimization method and device based on data heat and data blood margin Technical Field The application relates to the field of data processing, in particular to a storage strategy optimization method and device based on data heat and data blood edges. Background The development of traditional data storage strategies has undergone an evolution from single structured data storage to support diverse data types. Early storage strategies were based primarily on relational database models, whose core logic was a table-based data organization, and optimized read-write performance through an indexing mechanism (e.g., b+ tree). The storage mode is excellent in transaction processing, but with the arrival of a big data age, the storage and high concurrent access requirements of massive data expose inherent limitations, and particularly the hard disk I/O bottleneck problem significantly affects the overall performance of the system. To cope with this challenge, full memory engines are becoming an important storage solution, for example, hyPer, peloton and SAP HANA database systems store hot data on small pages in memory, and simultaneously store cold data after compression on large pages, so that the update cost is effectively reduced and the query efficiency is improved. Meanwhile, various innovative storage strategies are proposed in academia and industry for the needs of mixed transaction and analytical processing (HTAP) scenarios. These strategies can be divided into two categories, memory rank selection based on main rank storage and rank hybrid storage based on load driving. The method comprises the steps of dynamically adjusting column storage data in a memory to enable frequently accessed columns to be quickly searched, and automatically switching row storage and column storage modes according to the heat of the data to adapt to different query loads. However, existing researches still have a certain limitation, for example, the method of manually designating a memory rank set has insufficient adaptability to the change of query load, and the automatic selection method may cause excessive calculation cost due to complex algorithm design. Therefore, how to reduce the complexity of storage strategies while guaranteeing efficient queries remains an important topic of current research. In addition, the advent of distributed storage systems further enriches the choice of data storage strategies. For example, the distributed storage system based on HBase realizes efficient storage and load balancing of industrial time sequence data through pre-partition and cold and hot data classification strategies. Although these strategies alleviate the deficiencies of traditional storage approaches to some extent, they often rely on specific application scenarios and are difficult to directly popularize into other fields. Therefore, the search for a storage strategy with both versatility and flexibility is an important direction of future research. Disclosure of Invention Aiming at the problems in the prior art, the application provides a storage strategy optimization method and a device based on data heat and data blood margin, which can effectively solve the defects of the traditional technology in heat evaluation, blood margin analysis, storage optimization and the like and provide technical guarantee for software system storage management. In order to solve at least one of the problems, the application provides the following technical scheme: In a first aspect, the present application provides a method for optimizing a storage policy based on data heat and data blood edges, including: Collecting data packet size, communication frequency, data type and timestamp information of a node in a software system, counting access frequency, access time and user behavior of the node, calculating access frequency heat, time attenuation heat and user behavior heat of the node, and combining the access frequency heat, the time attenuation heat and the user behavior heat to generate a node heat value; constructing a directed graph taking the nodes as vertexes and the dependency relationship among the nodes as edges, taking the node heat value as a node attribute, taking the computing resource consumption of the nodes as edge weight, and performing iterative training on the directed graph by using a graph attention network to generate a dynamic data blood-edge map; Dividing the nodes into cold nodes, wen Jiedian and hot nodes based on the heat value of the nodes in the dynamic data blood-edge map, adopting a priority caching strategy for the hot nodes, compressing and storing the cold nodes, reserving the direct dependency relationship between the cold nodes and the hot nodes and between the cold nodes and the hot nodes, deleting the indirect dependency relationship of the cold nodes, and dynamically updating the node characterization vector of the graph-annotation force network based on the heat value of the nodes. Furth