CN-121996735-A - Massive open source code storage method and system based on atomization and bidirectional indexing

CN121996735ACN 121996735 ACN121996735 ACN 121996735ACN-121996735-A

Abstract

The invention discloses a mass open source code storage method and system based on atomization and bidirectional indexing, wherein the method and system comprise the steps of splitting accessed software code data into unique and unchangeable code atoms, storing the unique and unchangeable code atoms into a source code compression storage structure based on content addressing, constructing an index, wherein the source code compression storage structure comprises three areas, namely a metadata area, an atomic code pool and a multidimensional index area, wherein the metadata area is used for format version numbers, storing statistical information, physical offset addresses and lengths of the atomic code pool and physical offset addresses and lengths of the multidimensional index area, the atomic code pool is used for storing the code atoms, the multidimensional index area is used for storing version reconstruction forward indexes and code fragment reverse indexes, and the statistical information comprises software code item numbers, software code version numbers and code atom total numbers. The invention has remarkable advantages and beneficial effects in the aspects of storage efficiency, access performance, analysis capability, system architecture and the like.

Inventors

ZHANG SHUO
DING ZHIMING
YANG JIANWEN

Assignees

中国科学院软件研究所

Dates

Publication Date: 20260508
Application Date: 20251211

Claims (10)

1. A mass open source code storage method based on atomization and bidirectional indexing comprises the steps of splitting accessed software code data into unique and unchangeable code atoms, storing the unique and unchangeable code atoms into a source code compression storage structure based on content addressing and constructing an index, wherein the source code compression storage structure comprises three areas, namely a metadata area, an atomic code pool and a multidimensional index area, the metadata area is used for storing format version numbers, statistical information, physical offset addresses and lengths of the atomic code pool and physical offset addresses and lengths of the multidimensional index area, the atomic code pool is used for storing the code atoms, the multidimensional index area is used for storing version reconstruction forward indexes and code segment reverse indexes, and the statistical information comprises software code item numbers, software code version numbers and code atom total numbers.
2. The method of claim 1, wherein splitting software code data into unique, immutable code atoms is performed by traversing a complete commit history of an open source software code item managed by a version control system, accessing each commit object and its associated tree object, ultimately obtaining all of the blob objects under the commit version snapshot, calculating a hash value for each blob object as a globally unique identifier AtomID of the blob object, if the blob object is a text class code file, resolving the blob object into an AST using a code resolver, splitting the blob object into a plurality of semantically relatively complete code segments using defined boundaries of a "function", "method", or "class" as split points, if the code resolver cannot resolve the blob object, splitting the blob object into a plurality of segments as code segments using a row-or content-defined split, and then calculating a hash value for each code segment as a unique flag AtomID _ag_i of the segment.
3. The method of claim 2, wherein the method of storing the code atom in the atomic code pool is to first query whether a globally unique identifier AtomID of the current code atom already exists in the atomic code pool, compress the current code atom if not, obtain compressed data CompressedAtomData, then append a key pair [ AtomID, compressedAtomData ] to the end of the atomic code pool and update the metadata, and discard the storing operation on the current code atom if the globally unique identifier AtomID of the current code atom already exists in the atomic code pool.
4. The method of claim 2, wherein the method for constructing the version reconstruction forward index is to construct a Key Value storage structure, wherein a Key is generated according to an item unique identifier ProjectID, a version identifier VersionID and a file path FilePath to which a blob object belongs, a code atom list AtomManifes of the blob object is used as a Value of the Key, the code atom list AtomManifest is an ordered list, global unique identifiers AtomID of all code atoms required for reconstructing the blob object are recorded, and the Key and the Value corresponding to each blob object are written into the Key Value storage structure to obtain the version reconstruction forward index.
5. The method of claim 2 wherein the method of constructing the code segment reverse index comprises constructing an inverted index, using a unique identification Fragment-AtomID of a code segment as a Key Key of the code segment, using an appearance position list PostingList of the code segment as a Value of the code segment, and writing the Key Key of each code segment and the Value corresponding to the Key Key into the inverted index to obtain the code segment reverse index.
6. The method of claim 5, wherein each occurrence Location information of the code segment is recorded by using a structure Location (ProjectID, versionID, filePath, fragmentIndex/Offset), wherein ProjectID is a unique identifier of an item to which the code segment belongs, versionID is a version identifier to which the code segment belongs, and FilePath FragmentIndex/Offset indicates which segment of the blob object the code segment is or a start Offset thereof.
7. The method of claim 5, further comprising an artificial intelligence based semantic association index that complements or augments the code segment reverse index by first converting each code segment F_i into a high-dimensional feature Vector vector_i and then constructing a Vector index that stores a mapping of the high-dimensional feature Vector vector_i to its corresponding structure Location or unique identification Fragment-AtomID.
8. A method for querying a mass of open source codes stored based on the method of claim 1, comprising the steps of: 1) When a query request of a user for acquiring a target file is received, querying the version reconstruction forward index to acquire a Key Key corresponding to the target file; 2) Obtaining a corresponding Value according to the Key Key corresponding to the target file, namely a code atom list AtomManifest of the target file; 3) And according to the code atom list AtomManifest of the target file, inquiring the atom code pool in batches, acquiring corresponding code atoms, and sequentially splicing the corresponding code atoms to form the target file and returning the target file to a user.
9. A method for querying a mass of open source codes stored based on the method of claim 1, comprising the steps of: 1) When a query request for querying the vulnerability code segment Vulnerable _ Snippet is received, calculating a unique identifier Fragment-AtomID _ Vuln of the vulnerability code segment Vulnerable _ Snippet; 2) Querying the code segment reverse index to obtain a Value corresponding to a Key Key corresponding to a unique identifier Fragment-AtomID _ Vuln, namely an appearance position list PostingList of the vulnerability code segment Vulnerable _ Snippet; 3) The affected area report is output based on the list PostingList of the appearance locations of the vulnerability code segment Vulnerable _ Snippet.
10. The mass open source code storage system based on atomization and bidirectional indexing is characterized by comprising a code atom processor, an index constructor, a source code compression storage structure, a code fragment instant analyzer and a data reconstructor; The code atom processor is used for splitting software code data into unique and invariable code atoms; the index constructor is used for constructing a forward index and a code segment reverse index based on the code atoms in version reconstruction; The source code compression storage structure comprises three areas, namely a metadata area, an atomic code pool and a multidimensional index area, wherein the metadata area is used for storing statistical information, the physical offset address and length of the atomic code pool and the physical offset address and length of the multidimensional index area; the data reconstructor is used for querying the version reconstruction forward index to obtain a Key Key corresponding to the target file when receiving a query request of a user for obtaining the target file, obtaining a corresponding Value according to the Key Key corresponding to the target file, namely a code atom list AtomManifest of the target file, querying the atom code pool in batches according to the code atom list AtomManifest of the target file, obtaining corresponding code atoms and sequentially splicing the code atoms to form the target file; The code segment instant analyzer is used for calculating a unique identification Fragment-AtomID _ Vuln of the vulnerability code segment Vulnerable _ Snippet when a query request for querying the vulnerability code segment Vulnerable _ Snippet is received, querying the code segment reverse index to obtain a Value corresponding to a Key Key corresponding to the unique identification Fragment-AtomID _ Vuln, namely an appearance position list PostingList of the vulnerability code segment Vulnerable _ Snippet, and outputting an affected range report based on the appearance position list PostingList of the vulnerability code segment Vulnerable _ Snippet.

Description

Massive open source code storage method and system based on atomization and bidirectional indexing Technical Field The invention belongs to the technical field of computer data storage and management, relates to mass data storage, data compression and data indexing technologies, and particularly relates to a mass open source code storage method and system based on atomization and bidirectional indexing. Background In recent years, open source software has become the basis of modern information technology. In order to ensure the safety, compliance and stability of the software supply chain, a large-scale open source software supply chain infrastructure platform is built. Such platforms require capturing, aggregating, storing and analyzing massive amounts of open source software data from a code hosting platform (e.g., gitHub) or the like worldwide, wherein the open source software's source code (including its complete iteration history) is the portion that occupies the largest amount of storage space. Existing mainstream open source software code storage schemes rely primarily on a generic file system or object store (e.g., S3, HDFS) to store large amounts of code compression packages (e.g., gzip), or directly store the original warehouse data of a code version control system (e.g., git). The schemes are mainly focused on 'backup' and 'archiving' of data, have weak data management and analysis capability, and are difficult to adapt to the application requirements of modern open source software supply chain infrastructure platforms for deep analysis, security audit and compliance inspection of mass codes. The novel storage management method suitable for massive, multi-version and high-redundancy open source software codes is constructed, and is a precondition for realizing online analysis and supply chain safety management. In terms of mass code version management, the most relevant work to the present invention is the object model of the Git itself and its Packfile storage mechanism. As a distributed version control system, the Git base layer implements a content addressable file storage and retrieval system that stores file content as blob objects, uniquely identified by their SHA-1 hash value of the content. To save space, git would package multiple objects into one Packfile (pack file) and employ delta (i.e., difference) compression techniques in the packaging process, i.e., a version of a file may be stored as a delta (i.e., difference) between it and another "reference object". First, the main disadvantage of existing mainstream generic file system or compressed package storage schemes is the complete "code agnostic". They cannot understand the version iteration relationship and content repeatability of the code, and cannot perform any form of content retrieval and analysis, and are not suitable for the application scenario of the present invention. Second, for the GIT PACKFILE mechanism most relevant to the present invention, while efficient version storage is achieved within a single project, when applied to an open source software supply chain infrastructure platform that aggregates millions or even more open source projects worldwide, there are significant drawbacks and serious drawbacks in terms of data management. (1) Existing work (Git) only solves the redundancy of "intra-project" and does not solve the global redundancy problem of "cross-project". The compression optimization of Git is within one Project (i.e., intra-Project). If two different items A and B each contain a same file in their code base, the two items each generate their own Packfile when packaged, resulting in the same file being physically stored two or more times. For a global platform that aggregates millions of items or even more, such redundant storage across items (i.e., inter-Project) would result in significant space wastage. (2) Existing work (Git) sacrifices access efficiency for compression, with "delta chain" access delays. The incremental compression mechanism of Git results in access inefficiency. In order to obtain the file content of a particular historical version, if the file is stored as an increment, the system must first load the reference object and then apply a long string of "increment chains" in sequence to finally reconstruct the target file. When the version iterates quickly and the delta chain is long, this reconstruction process consumes a lot of CPU and I/O resources, resulting in high access delay. (3) Existing work (Git) lacks content analysis capability and cannot reverse index code fragments. The most critical drawback is that the model of Git is a forward index (i.e., find "file content" from "project and version"), which does not provide a reverse index from "file content" or "code fragment" back to all its occurrence locations at all. When the platform needs to perform security analysis (e.g., looking up in which projects, which versions a known vulnerability code segment appears) or complianc