CN-120994631-B - JSON data incremental version management-oriented field granularity storage layout optimization method

CN120994631BCN 120994631 BCN120994631 BCN 120994631BCN-120994631-B

Abstract

The invention provides a field granularity storage layout optimization method for JSON data incremental version management, and relates to the field of data version management. The method comprises the steps of S1, constructing a storage graph model supporting field granularity materialization, S2, converting a version graph into a storage graph with minimized total storage overhead according to the storage graph model, S3, executing optimization operation on all versions from two layers of field granularity and version granularity if the minimized total storage overhead is smaller than storage space constraint, generating a candidate change edge set for optimizing the storage graph, S4, introducing user access preference as an adjustment factor, evaluating values of all change edge sets, and S5, sequentially adding the change edge sets to the storage graph according to the values from large to small under the condition that the storage space constraint is not damaged so as to update the storage graph. The invention can ensure that the total storage cost is smaller than the storage space constraint and simultaneously correctly adapt to the preference of the user to the complete version inquiry and the history inquiry of the specific field.

Inventors

ZHANG XIAOTONG
CHEN FANGYI
HE JIE
GUO QIAN
CHEN WENCONG
DU WENDI

Assignees

北京科技大学

Dates

Publication Date: 20260505
Application Date: 20250625

Claims (10)

1. A field granularity storage layout optimization method for JSON data incremental version management is characterized by comprising the following steps: S1, constructing a storage graph model supporting field granularity materialization; s2, converting the version map into a memory map with minimized total memory overhead based on an MCA algorithm according to the constructed memory map model; S3, if the minimized total storage cost is smaller than the storage space constraint, performing optimization operation on all versions from two layers of field granularity and version granularity to generate a candidate change edge set for optimizing a storage graph, wherein the generated candidate change edge set comprises a complete materialized change edge set, a field granularity materialized change edge set constructed based on query load, a direct increment change edge set and a change edge set removed by an MCA algorithm; s4, introducing user access preference as an adjusting factor, and evaluating the value of all the change edge sets; S5, sequentially adding the change edge sets to the memory map with the minimized total memory overhead according to the value from large to small under the condition of not damaging the memory space constraint so as to update the memory map.
2. The method for optimizing field granularity storage layout for JSON data oriented incremental version management of claim 1 wherein said storage graph model uses a storage graph representation based on edge attribute graph, said representation supporting representation of field level increments, wherein said method uses Representing a memory map supporting field granularity materialization, The JSON data increment sets respectively represent a node set of the storage graph, an edge set of the storage graph and a version.
3. The field granularity storage layout optimization method for JSON data-oriented incremental version management according to claim 2, wherein the storage graph representation mode based on the edge attribute graph comprises: For the following Definition of Wherein, the method comprises the steps of, To store the graph Is a version of one of the nodes in the network, a version of the image is shown and, Can be obtained through reconstruction of different preamble versions and different JSON data increments; representing a memory map Is defined by a set of nodes; Is of version Is a JSON data delta set representing a set of version capable of being reconstructed Is a JSON data increment of (b); Representing slave versions To version Is a JSON data increment of (b); Reference for JSON data delta For indicating the JSON data increment which is dependent on, rather than directly associating the node connection relation in the storage graph, wherein, for representing the version Is (are) JSON data delta If (3) Is not materialized, version Version node corresponding to JSON data increment pointed by preamble node in storage graph, if Is fully materialized or its reference is empty, version The preamble node in the memory map is If (3) Is field granularity materialized, version The preamble nodes in the memory map simultaneously contain And Corresponding version nodes, the former is used for representing the source of the materialized part, the latter is used for providing version context of the rest of the non-materialized part, wherein, the empty nodes Representing an empty document, which is a preamble node of all materialized versions; Depicting structural expression of JSON data increment in a memory graph by adopting edge expression mode with attribute, wherein one edge in the memory graph is represented , Representing a memory map Is provided with a set of edges of (c), using a quadruple to represent , Representing the start of an edge, i.e., a preamble version; representing the end point of the edge, namely the target version; representing JSON data increment corresponding to the edge; Representing version change operations contained by edges, each edge Representation, in preamble version Based on (a) applying JSON data increment Version change operation in (a) Obtaining the target version Is a data state of (2); Defining a set of variant edge sets , Representing all and JSON data delta The set of associated edges is formed by, Representation of A kind of electronic device Changing edge set in layout optimization process As an atomic unit of operation, the unit needs to be processed in its entirety, i.e. either remain All edges in (a) are deleted or all edges are deleted for the storage graph which has completed the optimization Each version node can only be represented by one JSON data increment at most, i.e 。
4. The JSON data incremental version management-oriented field granularity storage layout optimization method of claim 1, wherein the MCA algorithm is a cinnabar-liu/idelmoz algorithm; the converting the version graph into the memory graph with minimized total memory overhead based on the MCA algorithm according to the constructed memory graph model comprises: Constructing dictionary according to constructed storage graph model and version graph , Save all Variable edge set collection of (1) And a map of its storage overhead, wherein, Representing all and JSON data delta The set of associated edges is formed by, Representing JSON data delta Is used to store the data in the memory, Is of version Is a JSON data delta set; Dictionary-based Converting the version graph into a weighted directed acyclic graph DAG, and obtaining a minimum cost tree graph, namely a storage graph with minimum total storage overhead, through a cinnabar-Liu/Edmonz algorithm Wherein, in the DAG, Is regarded as a direction Is weighted by 。
5. The JSON data delta version management-oriented field granularity storage layout optimization method of claim 1, wherein constructing a field granularity materialized variant edge set based on query load comprises: performing field granularity materialization by taking a query path as a target according to JSON data increment corresponding to a query version based on a query load, wherein the query load comprises the query version, the query path and the query frequency; and if the query path is covered by the current processing path or is a materialized path prefix, skipping the generation of the change edge set of the query path.
6. The JSON data delta version management-oriented field granularity storage layout optimization method of claim 1, wherein generating a direct delta change edge set comprises: Current version and last materialized ancestor node thereof The version sequence in between is compressed to form a new JSON data increment, and the new JSON data increment is used for generating the new JSON data increment Reference to JSON data increment Make the pointer Iteratively access Preamble node of (c) If the preamble version of (1) is fully materialized, if so, return To the point of Is a sequence of all JSON data increments Otherwise will Setting the JSON data increment sequence as the preamble version, repeating the iterative operation, merging the JSON data increment sequence finally obtained For representing Directly reconstructed JSON data delta.
7. The JSON data delta version management oriented field granularity storage layout optimization method of claim 1, wherein evaluating the value of all variant edge sets comprises: determining an evaluation function for a set of variant edges Using an evaluation function Calculating the value of each change edge set in the candidate change edge set added to the memory map with minimized total memory overhead, wherein the function is evaluated Expressed as: Wherein, the Representing the value; Representing an increase in storage overhead after adding the corresponding change edge set; Representing the reduction of the complete reconstruction overhead after adding the corresponding change edge set; Representing the reduction of the historical query cost of a specific field after adding the corresponding change edge set; representing a normalization function; Representing a memory map to The number of child nodes, which are root nodes, is used for embodying the effect of modifying the current version on all versions contained on the retrieval path; And The method comprises the steps of respectively, completely reconstructing overhead weights and specific field historical query overhead weights, wherein the complete reconstruction overhead weights and the specific field historical query overhead weights are used for controlling different contributions of different overheads to the value of a change edge set under different access preferences of users; 、、 Respectively represent versions Storage overhead, version of (a) Is a complete reconstruction overhead, version of (a) Is a specific field history query overhead.
8. The JSON data delta version management oriented field granularity storage layout optimization method of claim 7, wherein the version Storage overhead of (2) Expressed as: Wherein, the Representing JSON data delta Is used to store the data in the memory, Is of version Is a JSON data delta set; Version of Is the complete reconstruction overhead of (a) Expressed as: Wherein, the Minimizing the function; representing JSON data delta Is a complete reconstruction overhead of (1); Version of Is of the query cost of (1) Expressed as: Wherein, the Representing JSON data delta Is a query cost of (1); representing the query load.
9. The JSON data delta version management oriented field granularity storage layout optimization method of claim 4, further comprising: the current version is obtained and used as a leading version and a attribution branch of the new version; Registering a user to submit a new version JSON data increment, updating the current version information to a newly submitted version, and updating a version graph to indicate the logical relationship between the new version and the existing version; after the version layout optimization time is reached, calling the operation of the steps S1-S5 to the version map to generate a new storage map; And according to the result in the generated new storage diagram, materializing/anti-materializing the corresponding JSON data increment, and storing the JSON data increment into a JVD, wherein the JVD represents a JSON data-oriented version management system supporting multiple granularities.
10. A computer readable storage medium having stored therein program code which is callable by a processor to perform the method of any one of claims 1 to 9.

Description

JSON data incremental version management-oriented field granularity storage layout optimization method Technical Field The invention relates to the technical field of data version management, in particular to a field granularity storage layout optimization method for JSON data incremental version management. Background In the context of rapid advances in information technology today, data has become an important asset for various businesses and organizations. JSON (JavaScript Object Notation) is used as a semi-structured data format, and has been widely used in a plurality of fields such as machine learning, scientific computing and engineering application due to the characteristics of high readability, flexible support of nested structures and the like. The traditional Version Control System (VCS) is mainly oriented to version management of files, lacks optimization for JSON data, is difficult to support a fine-grained query function for the data and historical versions thereof, has good data query capability but does not support automatic version control, usually requires additional construction of a special version structure and manual maintenance of an evolution process in a data modeling stage, is easy to cause redundancy and errors, and is suitable for linear time query, but is difficult to adapt to nonlinear version management requirements such as structure change, branch evolution and the like in the JSON data. However, JSON data tends to be highly dynamic in large scale and complex systems. Taking a deep learning scenario as an example, training data, model parameters, interface inputs may frequently generate new versions in the preprocessing stage, the model iteration stage, and the inference deployment stage. Under the background, the multi-version JSON data is stored in an incremental mode, so that the storage cost can be saved, the data evolution process can be reserved, and the method has important significance. But relying on incremental storage alone can result in historical queries requiring backtracking of large amounts of change data, affecting query performance. One possible method is to implement storage layout optimization based on materialization, that is, based on incremental storage, a snapshot is built for part of fields of part of the key version, so as to reduce the calculation amount of backtracking increment. Reasonable materialization strategies can balance storage and query efficiency, but it is critical to select which versions need materialization. If the materialized version is too small, a large amount of incremental data still needs to be traced back during query to influence the query performance, and if the materialized version is too large, too many storage resources are occupied, and the advantage of incremental storage is weakened. The materialized-based version storage strategy researches how to materialize for certain versions, namely, the full-volume storage partial versions construct a storage graph, so as to improve the performance of the version management system in terms of storage, version switching, version submission and cross-version query and operation. The objective of such methods is to build an optimized physical structure based on the logical relationships of the versions. Version nodes without importation are full storage versions, i.e. materialized nodes, which means that the versions are stored in their entirety and can be accessed directly without relying on other versions of data, and importation nodes are incremental storage versions, which means that the versions store difference data from their previous versions, and that the complete version data needs to be reconstructed by applying the differences. Most materialized-based version storage strategies focus on archive and reconstruction optimization of linear version chains, and are difficult to adapt to nonlinear, multi-branch version graph structures commonly existing in JSON data. The semantic merge strategy as proposed by Buneman et al is applicable to XML archiving, seering et al optimizes version storage of an array database based on a minimum spanning tree, but neither supports branch merging nor field granularity querying. Bhattacherjee et al set up on the balance frame of the global minimum spanning tree and the shortest path tree, although the storage and reconstruction cost can be considered to a certain extent, the optimization granularity is still remained at the document level, the field level access preference is not modeled, the insertion and evaluation strategies of the reconstruction edge are further improved on heuristic optimization methods by Zhou et al and Guo et al, and the non-uniform query behavior at the field level can not be effectively processed. Based on the above needs, how to properly select materialized versions is a key to optimizing storage layout. The selection of the materialized version often takes into account the following three issues: First, since sto