Search

CN-122018813-A - HDFS intelligent hierarchical storage method and system based on multidimensional heat sensing and tape library integration

CN122018813ACN 122018813 ACN122018813 ACN 122018813ACN-122018813-A

Abstract

The invention discloses an HDFS intelligent hierarchical storage method and system based on multi-dimensional heat sensing and tape library integration, which belong to the technical field of distributed file systems and comprise the steps of expanding HDFS metadata to construct a unified naming space, collecting data block access characteristics and dynamically dividing hot, warm and cold data by using a K-means clustering algorithm, executing a self-adaptive hierarchical strategy, namely, adding copies of hot data to SSD, maintaining default copies of the hot data to be stored in a disk, transferring cold data to a tape library to be stored by adopting erasure codes, realizing distributed storage and fault tolerance of relevance sensing through a message queue, and carrying out metadata persistence and periodical reevaluation to realize automatic transfer and dynamic adjustment of the data among three layers of SSD, the disk and the tape library. By adopting the method and the system, the storage strategy and the data access value are precisely matched, the storage cost is finally reduced, and the I/O performance and the resource utilization rate of the system are improved.

Inventors

  • ZHOU YI
  • Qian Mengdong
  • YU ZHE
  • CHANG SHENG
  • ZHOU SIHAN
  • ZHAO LU
  • XU XIAOZHOU
  • WANG DING

Assignees

  • 苏州工学院

Dates

Publication Date
20260512
Application Date
20260226

Claims (9)

  1. 1. The HDFS intelligent hierarchical storage method based on multidimensional heat sensing and tape library integration is characterized by comprising the following steps of: S1, constructing a unified naming space, namely redesigning a file metadata structure in a memory by expanding a metadata management mechanism of a Hadoop Distributed File System (HDFS), and adding storageTierMask attributes for recording a storage layer where a file is and tapeVolume attributes for recording a tape roll where the file is to the file metadata, so that a tape library is logically integrated with a standard DISK (DISK) and a Solid State DISK (SSD) as a ARCHIVE storage layer to construct a unified memory directory tree which is realized in the memory and provides a transparent file system view; S2, multi-dimensional heat sensing, namely acquiring multi-dimensional access characteristics of data blocks in the HDFS, including historical access frequency, response time, concurrency, time attenuation characteristics and data block size, under a unified naming space realized based on a unified memory directory tree, and dynamically clustering the data blocks by utilizing a K-means clustering algorithm to divide the data into three levels of hot data, warm data and cold data; S3, executing a self-adaptive grading strategy, namely executing a corresponding storage strategy according to three grades of hot data, warm data and cold data, reserving 3 copies of the hot data and storing the hot data in an SSD medium, reserving 2 copies of the warm data and storing the warm data in a standard DISK DISK, migrating the cold data to a storage layer of a tape library ARCHIVE and storing the cold data by adopting a Reed-Solomon RS erasure code strategy, and updating storageTierMask and tapeVolume attributes of corresponding file metadata in a unified memory directory tree after the data is successfully written into the tape library; s4, performing association sensing decentralized storage, namely managing read-write requests to a storage layer of the tape library ARCHIVE through a message queue, and ensuring I/O load balance in the data migration process; And S5, metadata persistence and life cycle management, namely persistence of the unified memory directory tree into the disk mirror image file FsImage, recording of an operation log EditLog, guaranteeing of reliability of metadata and quick recovery after system restarting through a periodic check point mechanism, setting of a periodic re-evaluation strategy, and automatic migration and dynamic adjustment of data among different storage layers.
  2. 2. The HDFS intelligent hierarchical storage method based on multi-dimensional heat sensing and tape library integration according to claim 1, wherein step S1 specifically comprises: s11, a INodeFile-class memory metadata structure of the HDFS is redesigned, a unified memory directory tree of the hierarchical storage system is constructed, and seamless fusion of the tape layer files in the HDFS is realized; S12, expanding an HDFS metadata management mechanism, adding storageTierMask attributes for INodeFile types to record storage layer information, adding tapeVolume attributes to record tape roll information, wherein storageTierMask attributes represent a storage layer in which a file copy exists according to binary bits; s13, redesigning a file size recording mechanism in INodeFile types, and explicitly recording fileSize attributes aiming at the magnetic tape layer file; S14, establishing FSDirectory types to maintain a unified memory directory tree structure, and establishing a directory hierarchical structure in the whole memory by maintaining INodeDirectory type root directories; S15, a redesigned metadata structure is realized at the HDFS source code level, and seamless fusion of the metadata of the magnetic tape layer files in the unified memory directory tree is ensured.
  3. 3. The HDFS intelligent hierarchical storage method based on multidimensional heat sensing and TAPE library integration according to claim 2, wherein in step S13, storageTierMask attributes adopt a unified binary bit marking method, each bit is defined according to the sequence < tier_ssd_mask, tier_disk_mask, tier_tape_mask >, the corresponding position value of the existing copy is 1, and the value rule is: The value is 1, the file only has SSD layer; The value is 2, the file only has a DISK layer; the value is 3, the file has copies in the SSD layer and the DISK layer; the value is 4, the file only has a TAPE layer; The value is 5, the file has copies in the SSD layer and the TAPE layer; the value is 6, the file has copies in the DISK layer and the TAPE layer; The value is 7, the file has copies on SSD, DISK, TAPE three layers; tapeVolume attribute storageTierMask when the file exists in the TAPE layer, tapeVolume attribute records the unique identification of the TAPE roll where the file exists, and is used for positioning the TAPE library.
  4. 4. The HDFS intelligent hierarchical storage method based on multi-dimensional heat sensing and tape library integration according to claim 1, wherein step S2 specifically comprises: s21, collecting multidimensional access characteristics of data blocks in the HDFS through a log collecting tool, wherein the multidimensional access characteristics comprise historical access frequency, response time, concurrency, time attenuation characteristics and data block size; S22, analyzing the multidimensional access characteristics by using a K-means unsupervised machine learning clustering algorithm, optimizing initial cluster center selection by using a K-means++ algorithm, and determining the optimal cluster number by using an elbow rule; S23, dividing the data blocks into three grades, namely hot data, warm data and cold data according to the clustering result.
  5. 5. The HDFS intelligent hierarchical storage method based on multi-dimensional heat sensing and tape library integration according to claim 1, wherein step S3 specifically comprises: S31, storing the hot data in an SSD medium by adopting a 3-copy strategy; s32, storing the temperature data in a standard DISK DISK by adopting a 2-copy strategy; S33, encoding the cold data by adopting a Reed-Solomon RS erasure code algorithm, and migrating the cold data to a tape library storage layer.
  6. 6. The HDFS intelligent hierarchical storage method based on multi-dimensional heat sensing and tape library integration according to claim 1, wherein step S4 comprises: S41, managing request information of the magnetic tape layer through the information server, and establishing a read-write request queue mechanism; S42, when the failure of the tape layer data access is detected, the tape layer data access is automatically switched to the disk layer copy to provide service, the data reconstruction process is triggered, and the tape medium health state monitoring and fault tape isolation mechanism is established.
  7. 7. The HDFS intelligent hierarchical storage method based on multi-dimensional heat sensing and tape library integration according to claim 1, wherein step S5 specifically comprises: S51, adopting Google Protocol Buffer data serialization protocol to persist unified memory directory tree containing metadata information of the tape library file into FsImage file; s52, merging EditLog and FsImage files through a periodic check point mechanism; S53, setting a periodic re-evaluation strategy, and re-executing the step S2 and the step S3 according to a preset time interval.
  8. 8. An HDFS intelligent hierarchical storage system based on multi-dimensional heat sensing and tape library integration, configured to perform the HDFS intelligent hierarchical storage method based on multi-dimensional heat sensing and tape library integration according to any one of claims 1 to 7, comprising: The unified naming space construction module is used for redesigning a file metadata structure in the memory by expanding a metadata management mechanism of the HDFS, and constructing a unified memory directory tree which is realized in the memory and provides a transparent file system view; The multi-dimensional heat sensing module is used for collecting historical access frequency, response time, concurrency, time attenuation characteristics and data block size of the data blocks in the HDFS under a unified naming space realized based on a unified memory directory tree, and dynamically clustering the data blocks by utilizing a K-means clustering algorithm to divide the data into three levels of hot data, warm data and cold data; The self-adaptive hierarchical policy execution module is used for executing a corresponding storage policy according to three levels, reserving 3 copies of hot data and storing the hot data in an SSD medium, reserving 2 copies of the hot data and storing the hot data in a standard DISK DISK, migrating cold data to a storage layer of a tape library ARCHIVE and storing the cold data by adopting a Reed-Solomon RS erasure code policy, and updating storageTierMask and tapeVolume attributes of corresponding file metadata in a unified memory directory tree after the data is successfully written into the tape library; the relevance sensing decentralized storage module is used for managing read-write requests to a storage layer of the tape library ARCHIVE through a message queue and ensuring I/O load balance in the data migration process; the metadata persistence and life cycle management module is used for persistence of the unified memory directory tree into the disk mirror image file FsImage, recording the operation log EditLog, guaranteeing reliability of metadata and quick recovery after system restarting through a periodic check point mechanism, setting a periodic re-evaluation strategy, and realizing automatic migration and dynamic adjustment of data among different storage layers.
  9. 9. The HDFS intelligent hierarchical storage system integrated with a tape library based on multidimensional heat sensing of claim 8, the system further comprising: The abnormal processing and fault tolerance module is used for automatically switching to the disk layer copy to provide service when the tape layer data access fails, triggering the data reconstruction flow and establishing a tape medium health state monitoring and fault tape isolation mechanism.

Description

HDFS intelligent hierarchical storage method and system based on multidimensional heat sensing and tape library integration Technical Field The invention relates to the technical field of distributed file systems, in particular to an HDFS intelligent hierarchical storage method and system based on multidimensional heat sensing and tape library integration. Background With the continuous development of internet technology, we have entered the era of big data, so the application of technology related to big data should be more extensive. The Apache Hadoop framework and its core component HDFS have become key infrastructure for handling such large data due to their high throughput, high fault tolerance and high scalability. However, conventional HDFS suffers from the following significant drawbacks when dealing with mass, cold data storage that requires long-term storage: First, storage costs are high, HDFS adopts static three-copy policy by default to guarantee data reliability, which results in 300% total storage overhead, up to 200% overhead. For cold data with very low access frequency (such as experimental raw data that has been analyzed), this strategy creates a huge waste of storage resources. Second, the storage resource utilization is low, and the existing HDFS heterogeneous storage research (such as HDFS-2832) supports SSD/HDD classification, but lacks a fine management mechanism based on data access heat. This may result in high performance storage media (e.g., SSDs) being occupied with large amounts of cold data, while frequently accessed hot data may not obtain sufficient performance resources, resulting in resource mismatch. Finally, the tape library supports the deficiency, and the tape library is an ideal choice for storing massive cold data because of the advantages of large storage capacity, low cost, low energy consumption, long data storage period and the like. However, the existing HDFS architecture cannot natively support integration of a tape library, is difficult to automatically and transparently archive cold data into the tape library, and cannot meet the requirement of high cost performance archiving of mass data. In view of the foregoing, there is a strong need in the art for an HDFS optimization scheme that can dynamically sense data popularity, intelligently scale storage, and seamlessly integrate low cost tape library resources to significantly reduce overall storage costs while guaranteeing data reliability and access performance. Disclosure of Invention The invention aims to provide an HDFS intelligent hierarchical storage method and system based on multi-dimensional heat sensing and tape library integration, which solve the problems of high storage cost and low resource utilization rate caused by the fact that the tape library is not supported by traditional HDFS heterogeneous storage and the data placement strategy is single, realize intelligent hierarchical storage by integrating the tape library and combining the multi-dimensional heat sensing, accurately match the storage strategy with the data access value, finally reduce the storage cost, and improve the I/O performance and the resource utilization rate of the system. In order to achieve the above purpose, the invention provides an HDFS intelligent hierarchical storage method based on multi-dimensional heat sensing and tape library integration, which comprises the following steps: S1, constructing a unified naming space, namely redesigning a file metadata structure in a memory by expanding a metadata management mechanism of a Hadoop Distributed File System (HDFS), and adding storageTierMask attributes for recording a storage layer where a file is and tapeVolume attributes for recording a tape roll where the file is to the file metadata, so that a tape library is logically integrated with a standard DISK (DISK) and a Solid State DISK (SSD) as a ARCHIVE storage layer to construct a unified memory directory tree which is realized in the memory and provides a transparent file system view; S2, multi-dimensional heat sensing, namely acquiring multi-dimensional access characteristics of data blocks in the HDFS, including historical access frequency, response time, concurrency, time attenuation characteristics and data block size, under a unified naming space realized based on a unified memory directory tree, and dynamically clustering the data blocks by utilizing a K-means clustering algorithm to divide the data into three levels of hot data, warm data and cold data; S3, executing a self-adaptive grading strategy, namely executing a corresponding storage strategy according to three grades of hot data, warm data and cold data, reserving 3 copies of the hot data and storing the hot data in an SSD medium, reserving 2 copies of the warm data and storing the warm data in a standard DISK DISK, migrating the cold data to a storage layer of a tape library ARCHIVE and storing the cold data by adopting a Reed-Solomon RS erasure code stra