CN-121980058-A - Unstructured data storage method, system and medium based on distributed database

CN121980058ACN 121980058 ACN121980058 ACN 121980058ACN-121980058-A

Abstract

The invention discloses an unstructured data storage method, a system and a medium based on a distributed database, which relate to the technical field of data storage, and are improved on the basis of the traditional technology through fine classification, differential extraction, collaborative cold and hot layering, dynamic slicing and copy configuration, the method realizes the efficient storage and quick query of unstructured data, improves the utilization rate of storage resources and the data access performance, and breaks through the limitation of traditional distributed storage and query splitting and policy staticization.

Inventors

HE PEIDONG
LIU SUJIE
LI RUICHAO
LI FANGSHUO
FAN LI
TU YAXIN
YANG XIAOXIAO
WANG JIAJU
DENG SHUYU

Assignees

国网四川省电力公司营销服务中心

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. A method of unstructured data storage based on a distributed database, the method comprising: Receiving a target data writing request, analyzing a data identifier, and classifying target data according to a preset classification rule to obtain data type and data size information; extracting data type characteristics and metadata from different types of target data based on a differential extraction strategy; Based on the data type characteristics and metadata, carrying out cold and hot layering processing on target data by combining with an Apache Doris technology, storing hot data in a solid state disk SSD, and storing cold data in a mechanical hard disk; calculating hash values of target data after cold and hot layering processing to determine storage nodes, and generating target data fragments by combining data size information; According to the type and the size of the target data fragments, configuring the differential quantity and the synchronization strategy to obtain a plurality of copies of the target data fragments; Mapping the target data fragments and the corresponding copies into volume identifiers, and storing the volume identifiers in corresponding storage nodes of the distributed database.
2. The unstructured data storage method based on a distributed database according to claim 1, wherein the preset classification rule comprises: Dividing target data into text class data, image class data, video class data and document class data according to the data type of the target data; and presetting a size threshold, comparing the size of the target data with the size threshold, and dividing the target data into small file data and large file data.
3. The distributed database-based unstructured data storage method of claim 2, wherein the differential extraction strategy comprises: for the text data, extracting text content and metadata package information, wherein the metadata package information comprises metadata package creation time, metadata package modification time and metadata package file size; extracting element numbers for image class data and generating multidimensional feature vectors; extracting a key frame in a period T for video data, generating a multidimensional feature vector from the key frame, and extracting duration, code rate and resolution; for document class data, text and metadata are extracted.
4. A method of unstructured data storage based on a distributed database according to claim 3, wherein the method of partitioning hot data and cold data in the cold-hot hierarchical process comprises: If the current target data belongs to video class data or image class data, dividing the current target data into hot data, otherwise, Dividing the target data with the access times exceeding a preset access times threshold value in the latest time period T into hot data; if the current target data belongs to text class data or document class data, the current target data is divided into cold data, otherwise, And dividing the target data with the creation time exceeding the creation time threshold and the access time less than the preset access time threshold into cold data.
5. The unstructured data storage method based on a distributed database according to claim 4, wherein the collaboration mechanism comprises a dynamic index adaptation mechanism, a query intelligent routing mechanism and a stored query feedback optimization mechanism; the dynamic index adaptation mechanism comprises the steps of constructing a three-level real-time index for hot data, wherein the first-level real-time index filters characteristic values without data based on a characteristic value bloom filter, the second-level real-time index is a typed hash index, the third-level real-time index is a metadata jump table index, constructing an aggregate compression index for cold data, aggregating the data according to a time window, and performing PCA (principal component analysis) dimension reduction compression on the characteristic values of the same type of data to generate a mapping table comprising a timestamp, a compressed characteristic vector and a storage node; The intelligent query routing mechanism comprises a query feature extractor, an automatic routing to three-level real-time index of hot data, triggering solid state disk SSD parallel IO response, an automatic routing to an aggregate compression index of cold data, triggering mechanical hard disk batch reading and compression feature decompression calculation when the query rate QPS is more than or equal to a first query rate threshold QPSe and the data range belongs to a latest time period T; Recording response delay of hot data query, and if the response delay of the current hot data exceeds a delay threshold value for n continuous times, upgrading the current hot data into super hot data and transferring the super hot data to a memory cache region; And recording the query frequency of the cold data, and returning the current cold data to the hot data if the query frequency of the current cold data exceeds the average query frequency threshold value for days for m consecutive days.
6. The unstructured data storage method based on a distributed database according to claim 5, wherein the typed hash index comprises a keyword double hash for hash indexing of text data and a key frame rolling hash for hash indexing of video data; Customizing hash modes aiming at different types of data, namely customizing keyword double hash for text data, customizing feature vector weighted hash for image data, customizing key frame rolling hash for video data; the metadata skip list index is constructed by taking the time stamp and the file size in metadata as core basis, and supports rapid screening of data according to time ranges and file sizes.
7. The unstructured data storage method based on a distributed database according to claim 5, wherein the method of aggregating compressed indexes comprises: the method comprises the steps of firstly, aggregating and grouping cold data according to the combination dimension of data types and quarters, then performing dimension reduction compression processing on the feature vector of image data and the feature vector of video data, and finally generating a mapping table comprising a time stamp range, a compressed feature mean value and storage nodes.
8. The unstructured data storage method based on a distributed database according to claim 2, wherein the calculating the hash value of the target data after the hot and cold layering process to determine the storage node and generating the target data fragment in combination with the data size information comprises the steps of: Obtaining an initial hash value of target data by using a hash code generating function, obtaining a temporary hash value by taking a modulus with the number N of the storage nodes, obtaining a final hash value by right shifting the temporary hash value and exclusive-or of the temporary hash value and the temporary hash value, and determining the storage nodes according to the final hash value; And uniformly dividing the large file into a plurality of small files, and combining the small files into a small file group.
9. A distributed database-based unstructured data storage system for implementing the distributed database-based unstructured data storage method of any of claims 1-8, said system comprising: The analysis and classification module is used for receiving the writing request of the target data, analyzing the data identification, and classifying the target data according to a preset classification rule to obtain the data type and the data size information; The extraction module is used for extracting data type characteristics and metadata from different types of target data based on a differential extraction strategy; The cold and hot layering processing module is used for carrying out cold and hot layering processing on target data based on data type characteristics and metadata by combining an Apache Doris technology, storing hot data in a solid state disk SSD and storing cold data in a mechanical hard disk; The computing module is used for computing the hash value of the target data after the cold and hot layering processing to determine a storage node and generating target data fragments by combining data size information; the system comprises a fragmentation module, a target data fragmentation module, a data processing module and a data processing module, wherein the fragmentation module is used for configuring the differential quantity and the synchronization strategy according to the type and the size of the target data fragmentation to obtain a plurality of copies of the target data fragmentation; and the storage module is used for mapping the target data fragments and the corresponding copies into volume identifications, and storing the volume identifications in corresponding storage nodes of the distributed database.
10. A computer readable medium having stored thereon a computer program, wherein the computer program is executable by a processor to implement a distributed database based unstructured data storage method as claimed in any of claims 1-8.

Description

Unstructured data storage method, system and medium based on distributed database Technical Field The invention relates to the technical field of data storage, in particular to an unstructured data storage method, an unstructured data storage system and an unstructured data storage medium based on a distributed database. Background With the popularity of computer information technology and the increasing number of internet users, the data volume increases at a rapid rate, and the proportion of unstructured data (such as text, images, video, documents, etc.) in the total data volume continues to rise. The traditional distributed storage technology can realize basic storage of unstructured data, but has the defects that firstly, data classification and feature extraction are lack of pertinence, a differentiation strategy is not adopted according to data types and sizes, so that data processing efficiency is low, secondly, cold and hot layering processing only realizes simple data separation, a cooperative mechanism is not established with query requirements, an index structure is single, query scenes of different heat data cannot be adapted, query response delay is high, thirdly, data slicing and copy configuration strategies are fixed, dynamic adjustment of data characteristics is not combined, storage resource waste or data reliability is not enough, fourthly, the storage layering and query requirements lack dynamic feedback optimization mechanism, and the change of a data access mode is difficult to adapt. In the prior art, as in CN116910310B, an unstructured data storage method based on a distributed database is disclosed, and data storage is realized through a mechanism of cold and hot layering, horizontal slicing and multiple copies, but the technology does not involve fine classification of data types and sizes, the cold and hot layering is not configured with a differential index structure, and query routing lacks intelligent adaptation capability, so that storage and query splitting are realized, and the slicing and copy strategies do not embody data characteristic differentiation, so that the requirements of efficient storage and quick query of massive unstructured data cannot be met. Therefore, there is a need for an unstructured data storage scheme with classification accuracy, storage suitability and query efficiency, and for this purpose, an unstructured data storage method and device based on a distributed database are improved and designed. Disclosure of Invention Aiming at the defects in the prior art, the invention aims to provide an unstructured data storage method, an unstructured data storage system and an unstructured data storage medium based on a distributed database, which are improved on the basis of the traditional technology, and realize the efficient storage and the rapid query of unstructured data through refined classification, differential extraction, collaborative cold and hot layering, dynamic slicing and copy configuration, thereby improving the utilization rate of storage resources and the access performance of data, and breaking the limitations of traditional distributed storage and query splitting and policy staticizing. The technical aim of the invention is realized by the following technical scheme: the present solution provides a distributed database-based unstructured data storage method, the method comprising: Receiving a target data writing request, analyzing a data identifier, and classifying target data according to a preset classification rule to obtain data type and data size information; extracting data type characteristics and metadata from different types of target data based on a differential extraction strategy; Based on the data type characteristics and metadata, carrying out cold and hot layering processing on target data by combining with an Apache Doris technology, storing hot data in a solid state disk SSD, and storing cold data in a mechanical hard disk; calculating hash values of target data after cold and hot layering processing to determine storage nodes, and generating target data fragments by combining data size information; According to the type and the size of the target data fragments, configuring the differential quantity and the synchronization strategy to obtain a plurality of copies of the target data fragments; Mapping the target data fragments and the corresponding copies into volume identifiers, and storing the volume identifiers in corresponding storage nodes of the distributed database. Further preferably, the preset classification rule includes: Dividing target data into text class data, image class data, video class data and document class data according to the data type of the target data; and presetting a size threshold, comparing the size of the target data with the size threshold, and dividing the target data into small file data and large file data. The further optimization scheme is that the differential extraction strategy co