Search

CN-121523623-B - Distributed parallel data management method and system based on hierarchical storage

CN121523623BCN 121523623 BCN121523623 BCN 121523623BCN-121523623-B

Abstract

The invention relates to the technical field of data storage, in particular to a distributed parallel data management method and system based on hierarchical storage, which are characterized in that a distributed data set is obtained, each data object in the distributed data set is numbered to obtain a first data object to an N data object, a data type sequence, a data size sequence and an access frequency sequence of the first data object to the N data object are obtained, the first data density to the N data density are obtained through data density analysis, and carrying out reference-dependent joint analysis on the first data object to the Nth data object to obtain a first weight matrix, carrying out density correction according to the first weight matrix to obtain first correction density to Nth correction density, and carrying out layered storage according to the first correction density to the Nth correction density to obtain a data storage hierarchy allocation result, thereby realizing more accurate and efficient storage management on the distributed parallel data.

Inventors

  • WANG XINZHENG
  • JIA XIAOJIE
  • CAI YUAN
  • CHEN XINLEI
  • ZHAO YUBING

Assignees

  • 企商在线(北京)数据技术股份有限公司

Dates

Publication Date
20260512
Application Date
20260119

Claims (8)

  1. 1. A distributed parallel data management method based on hierarchical storage, the method comprising: The method comprises the steps of obtaining a distributed data set, numbering each data object in the distributed data set to obtain a first data object to an N-th data object, wherein N represents the number of the data objects; Obtaining first data density to Nth data density through data density analysis according to the data type sequence, the data size sequence and the access frequency sequence; Performing reference-dependent joint analysis on the first data object to the Nth data object to obtain a first weight matrix, and obtaining first correction density to Nth correction density through density correction according to the first data density to the Nth data density and the first weight matrix; Performing hierarchical storage according to the first to N-th correction densities and the number of storage levels to obtain a data storage level distribution result, and performing hierarchical storage configuration on the first to N-th data objects according to the data storage level distribution result; the method for obtaining the first data density to the Nth data density through data density analysis according to the data type sequence, the data size sequence and the access frequency sequence comprises the following steps: calculating information quantity representation data according to the data type sequence, the data size sequence and the access frequency sequence, normalizing each information quantity representation data to obtain a normalized information quantity representation data sequence; The reference-dependence joint analysis comprises the steps of obtaining a data object pair with a dependency relationship and a reference relationship by scanning a file head, metadata and content of the data object and identifying the data object containing file path references and identifier references, calculating the dependency relationship strength and the reference relationship strength of the data object pair, superposing the dependency relationship strength and the reference relationship strength to construct an initial fusion matrix, setting iteration initial conditions, carrying out transmission association iterative optimization on the initial fusion matrix according to an iteration formula, wherein the iteration formula contains a preset transmission attenuation coefficient, and terminating the iteration process when the maximum strength value variation of two adjacent iterations is smaller than a preset convergence threshold value to obtain a first weight matrix; The density correction comprises the steps of calculating fusion enhancement factors of each data object according to the first data density to the Nth data density and a first weight matrix, wherein the value range of the enhancement factors is [0,1]; calculating the centrality index of each data object according to the first weight matrix, and obtaining first to Nth correction densities according to the first to Nth data densities, the fusion enhancement factors of each data object and the centrality index of each data object through weighted fusion calculation; the hierarchical storage comprises the steps of taking a data object as a top point set, taking a weight value of a first weight matrix as an edge weight value to form an edge set to construct a distributed data association graph, dividing the distributed data association graph into sub-graphs to obtain distributed data sub-graphs, and distributing the data objects belonging to the same distributed data sub-graph to the same storage level to serve as a data storage level distribution result.
  2. 2. The method for managing distributed parallel data based on hierarchical storage according to claim 1, wherein the method for reading the data types, the data sizes and the access frequencies of the first data object to the nth data object to obtain the data type sequence, the data size sequence and the access frequency sequence comprises the following steps: reading the data type identifiers of the first data object to the Nth data object according to a preset sampling interval to obtain a data type sequence; reading the occupied amount of the storage space from the first data object to the N data object to obtain a data size sequence; and reading the access times of the first data object to the Nth data object in a preset statistical period to obtain an access frequency sequence.
  3. 3. The distributed parallel data management method based on hierarchical storage according to claim 1, wherein the method for superposing the dependency relationship strength and the reference relationship strength to construct an initial fusion matrix and performing transfer association analysis on the initial fusion matrix to obtain the first weight matrix comprises the following steps: building dimensions according to dependency relationship strength and reference relationship strength Wherein Is the total number of data objects, specifically: When the first is Data object(s) Matrix elements when there are dependency or reference relationships between individual data objects The strength value is the corresponding dependency relationship strength and reference relationship strength; When the first is Data object(s) Matrix element values when there are no dependencies or reference relationships for the data objects Is 0; Calculate the first The connectivity of the individual data objects is: ; Wherein the method comprises the steps of Is the total number of data objects, To indicate a function, when the condition is true value is Otherwise, it is , To connect the decision threshold, a range of values is taken , Is the first weight matrix Line 1 Elements of a column; is the first weight matrix Line 1 Elements of a column; Is the index variable of the traversal through which the data is extracted, Is the first Connectivity of the data objects; Setting the iteration initial condition as follows Performing transfer association iterative optimization on the initial fusion matrix according to an iterative formula, wherein the iterative formula is as follows: ; Wherein the method comprises the steps of Is the first Post-iteration first Data object(s) The intensity value of the data object is calculated, Is a preset transmission attenuation coefficient and a range of values , Is the intermediate transfer node traversal index, Is the first Post-iteration first Data object(s) The intensity value of the data object is calculated, Is the first Post-iteration first Data object(s) The intensity value of the data object is calculated, Is the first Post-iteration first Data object(s) Intensity values of the data objects; And terminating the iteration process to obtain a first weight matrix when the maximum intensity value variation of two adjacent iterations is smaller than a preset convergence threshold.
  4. 4. The distributed parallel data management method based on hierarchical storage according to claim 1, wherein the method for performing density correction on the first to nth data densities according to the first weight matrix to obtain the first to nth correction densities comprises: And calculating the fusion enhancement factors of each data object according to the first data density to the Nth data density and the first weight matrix, wherein the fusion enhancement factors are as follows: ; Wherein the method comprises the steps of Is the first weight matrix Line 1 The elements of the column are arranged such that, Is the first The data density of the individual data objects is such that, Is the enhancement coefficient and the value range , Is the first Fusion enhancement factors for data objects; calculating the centrality index of each data object according to the first weight matrix as follows: ; Wherein the method comprises the steps of Is the first weight matrix Line 1 The elements of the column are arranged such that, Is the first weight matrix Line 1 The elements of the column are arranged such that, Is the index variable of the traversal through which the data is extracted, Is the total number of data objects that are to be processed, Is the first A centrality index of the data objects; And calculating the first to Nth correction densities through weighted fusion according to the first to Nth data densities, the fusion enhancement factor of each data object and the centrality index of each data object, wherein the first to Nth correction densities are as follows: ; Wherein the method comprises the steps of Is the first The data density of the individual data objects is such that, 、 、 Is a preset fusion the weight coefficient of the weight of the sample, And the value ranges are all , Is the first The modified density of the data object.
  5. 5. The distributed parallel data management method based on hierarchical storage according to claim 1, wherein the method for performing hierarchical storage according to the first to nth correction densities and the number of storage levels to obtain the data storage level allocation result comprises: Taking the data object as a top point set, and taking the weight value of the first weight matrix as an edge weight value to form an edge set to construct a distributed data association graph; Carrying out sub-division on the distributed data association graph to obtain a distributed data sub-graph; and distributing the data objects belonging to the same distributed data subgraph to the same storage hierarchy as a data storage hierarchy distribution result.
  6. 6. The distributed parallel data management method based on hierarchical storage according to claim 4, wherein the method for sub-dividing the distributed data association graph into distributed data sub-graphs comprises: Performing weight division according to the performance levels of the storage levels to obtain weight values of all levels, and setting constraint conditions that the sum of the sizes of the data objects distributed in each storage level does not exceed the preset capacity upper limit; the modified densities of the data objects are arranged in a descending order and distributed to the highest available level by a greedy algorithm to form a distributed data subgraph.
  7. 7. The hierarchical storage based distributed parallel data management method of claim 6, further comprising: Performing optimal storage hierarchy allocation adjustment on data objects belonging to the same distributed data subgraph before allocating the data objects belonging to the same distributed data subgraph to the same storage hierarchy as a data storage hierarchy allocation result, specifically: Setting an optimization target to maximize the correction density accumulated value of all data objects in each storage hierarchy; performing level migration calculation on the associated graph vertexes to obtain vertex adjustment gains, wherein the vertex adjustment gains are as follows: ; Wherein the method comprises the steps of Is the first Multiple associated graph vertices The level of storage before the adjustment is made, Is the first Multiple associated graph vertices The post-adjustment storage tier is used to store the tier, Is the associated graph vertex And storage hierarchy Other data objects in Is used to determine the sum of the intensity values of (c), Is the associated graph vertex And storage hierarchy Other data objects in Is used to determine the sum of the intensity values of (c), Is the first From the storage hierarchy, each associated graph vertex To a storage hierarchy Is adjusted to gain; and when the boundary vertex adjustment gain is larger than zero and the constraint condition is met after migration, vertex migration optimization is executed, and when the continuous twice optimization target change rate is smaller than a preset threshold value, optimization is terminated to obtain optimal storage hierarchy allocation.
  8. 8. A distributed parallel data management system based on hierarchical storage, for executing the method of any one of claims 1-7, wherein the system comprises a data reading module, a density analysis module, a reference-dependent analysis module, and a hierarchy allocation module, which are sequentially connected; The data reading module is used for obtaining a distributed data set, numbering each data object in the distributed data set to obtain a first data object to an N-th data object, wherein N represents the number of the data objects; The density analysis module is used for obtaining first data density to Nth data density through data density analysis according to the data type sequence, the data size sequence and the access frequency sequence; The reference-dependent analysis module is used for carrying out reference-dependent joint analysis on the first data object to the Nth data object to obtain a first weight matrix, and obtaining first correction density to Nth correction density through density correction according to the first data density to the Nth data density and the first weight matrix; The hierarchy allocation module is used for carrying out hierarchical storage according to the first correction density, the Nth correction density and the number of storage hierarchies to obtain a data storage hierarchy allocation result, and carrying out hierarchical storage configuration on the first data object and the Nth data object according to the data storage hierarchy allocation result; the method for obtaining the first data density to the Nth data density through data density analysis according to the data type sequence, the data size sequence and the access frequency sequence comprises the following steps: calculating information quantity representation data according to the data type sequence, the data size sequence and the access frequency sequence, normalizing each information quantity representation data to obtain a normalized information quantity representation data sequence; The reference-dependence joint analysis comprises the steps of obtaining a data object pair with a dependency relationship and a reference relationship by scanning a file head, metadata and content of the data object and identifying the data object containing file path references and identifier references, calculating the dependency relationship strength and the reference relationship strength of the data object pair, superposing the dependency relationship strength and the reference relationship strength to construct an initial fusion matrix, setting iteration initial conditions, carrying out transmission association iterative optimization on the initial fusion matrix according to an iteration formula, wherein the iteration formula contains a preset transmission attenuation coefficient, and terminating the iteration process when the maximum strength value variation of two adjacent iterations is smaller than a preset convergence threshold value to obtain a first weight matrix; The density correction comprises the steps of calculating fusion enhancement factors of each data object according to the first data density to the Nth data density and a first weight matrix, wherein the value range of the enhancement factors is [0,1]; calculating the centrality index of each data object according to the first weight matrix, and obtaining first to Nth correction densities according to the first to Nth data densities, the fusion enhancement factors of each data object and the centrality index of each data object through weighted fusion calculation; the hierarchical storage comprises the steps of taking a data object as a top point set, taking a weight value of a first weight matrix as an edge weight value to form an edge set to construct a distributed data association graph, dividing the distributed data association graph into sub-graphs to obtain distributed data sub-graphs, and distributing the data objects belonging to the same distributed data sub-graph to the same storage level to serve as a data storage level distribution result.

Description

Distributed parallel data management method and system based on hierarchical storage Technical Field The invention relates to the technical field of data storage, in particular to a distributed parallel data management method and system based on hierarchical storage. Background Along with the expansion of enterprise business scale and the distributed evolution of IT architecture, the data has the characteristics of mass, multi-source and distributed storage. Traditional centralized data storage and management, storage strategy formulation according to single dimensions such as data type and access frequency and the like are difficult to deal with the requirements of parallel access, efficient storage and management of data in a distributed environment. For example, in a distributed system, a configuration file at an edge node may have a low access frequency, but includes key parameters of cluster scheduling, so that the information value is extremely high, while a log file at a central node has a very important data type and frequent access, but the content of the log file is mostly repetitive information, so that the actual information value is limited. The rough management based on the surface features is difficult to accurately match the actual value of the data, so that the misdistribution of storage resources is caused, and the parallel processing performance and the data access efficiency of the whole system are affected. Existing research teams begin to pay attention to the information value of data itself, but analysis is often limited to independent data objects, and in a distributed parallel computing environment, the value of a data object is not only dependent on its own attributes, but also closely related to its location and association in a data network. For example, an index file may not have a high information density of its own, but it is frequently referenced by multiple compute nodes in a distributed system, a key entry for accessing multiple important data slices, and its actual value is far higher than that reflected by its own information density. In the prior art, due to the lack of consideration of complex references and dependency relationships among distributed data when evaluating the data value, the data with high network value is wrongly distributed to a low-performance storage hierarchy, thereby becoming the bottleneck of parallel processing capacity of a system and affecting the access performance of the data and the management efficiency of the whole system. Therefore, how to evaluate the information value of the data more accurately in the distributed parallel environment and realize efficient and reliable hierarchical storage management based on the information value is a technical problem to be solved. Disclosure of Invention The invention provides a distributed parallel data management method and system based on hierarchical storage, which are used for solving the problems that information value evaluation is inaccurate due to neglect of relativity among data during distributed data storage management, storage allocation is unreasonable and system parallelism performance is affected. To achieve the above object, in one aspect, the present invention provides a distributed parallel data management method based on hierarchical storage, the method including: The method comprises the steps of obtaining a distributed data set, numbering each data object in the distributed data set to obtain a first data object to an N-th data object, wherein N represents the number of the data objects, and reading the data types, the data sizes and the access frequencies of the first data object to the N-th data object to obtain a data type sequence, a data size sequence and an access frequency sequence. And obtaining the first data density to the Nth data density through data density analysis according to the data type sequence, the data size sequence and the access frequency sequence. And carrying out reference-dependent joint analysis on the first data object to the Nth data object to obtain a first weight matrix, and obtaining first correction density to Nth correction density according to the first data density to the Nth data density and the first weight matrix through density correction. And carrying out hierarchical storage according to the first to N-th correction densities and the number of storage levels to obtain a data storage level distribution result, and carrying out hierarchical storage configuration on the first to N-th data objects according to the data storage level distribution result. Further, the method for reading the data types, the data sizes and the access frequencies of the first data object to the Nth data object to obtain the data type sequence, the data size sequence and the access frequency sequence comprises the following steps: And reading the data type identifiers of the first data object to the Nth data object according to a preset sampling interval to obtai