CN-121979442-A - Fixed-length data deduplication method, device, equipment and program product

CN121979442ACN 121979442 ACN121979442 ACN 121979442ACN-121979442-A

Abstract

The application relates to a fixed-length data deduplication method, a fixed-length data deduplication device, computer equipment and a computer program product. The method comprises the steps of dividing each fixed-length data into a plurality of sub-data blocks before writing each fixed-length data, selecting a plurality of sampling data blocks from the plurality of sub-data blocks according to a preset sampling mode, determining sub-fingerprints corresponding to each sampling data block based on a preset hash algorithm, enabling the calculation cost of the preset hash algorithm to be lower than Jiang Haxi algorithm, combining all the sub-fingerprints according to a preset combination mode to obtain similar fingerprints of the fixed-length data, writing each fixed-length data and the similar fingerprints thereof into a storage medium, and performing data re-deleting based on the similarity of the similar fingerprints of each prune-length data to be re-deleted in a background re-deleting stage. The method reduces CPU computing overhead on a data writing path by improving a background re-deleting scheme, avoids I/O performance waste caused by reading non-repeated data, and realizes the minimization of the influence on the read-write performance of a storage system while ensuring the re-deleting effect.

Inventors

YU FEIFAN
LIU XIAOBO
SHAO CHANGGENG
GUO ZHAOBIN
MIAO YANCHAO

Assignees

曙光信息产业股份有限公司
曙光信息产业(北京)有限公司

Dates

Publication Date: 20260505
Application Date: 20251205

Claims (10)

1. The fixed-length data deleting method is characterized by comprising the following steps of: Before writing each fixed-length data, dividing each fixed-length data into a plurality of sub-data blocks according to a preset dividing mode; Selecting a plurality of sampling data blocks from the plurality of sub data blocks according to a preset sampling mode; Determining sub fingerprints corresponding to the sampling data blocks based on a preset hash algorithm, wherein the calculation cost of the preset hash algorithm is lower than Jiang Haxi algorithm; combining all the sub-fingerprints according to a preset combination mode to obtain similar fingerprints of the fixed-length data; And writing the fixed-length data and the similar fingerprints thereof into a storage medium, reading the similar fingerprints of the to-be-repeated prune long data from the storage medium in a background deleting stage, and deleting the data based on the similarity of the read similar fingerprints.
2. The method of claim 1, wherein the step of data deduplication based on the similarity of the read similar fingerprints comprises: dynamically calculating a current similarity threshold based on the read similar fingerprints; based on the current similarity threshold, grouping the read similar fingerprints to gather similar fingerprints with similarity higher than the current similarity threshold into the same group; And deleting the data to be deleted prune long which are consistent with the similar fingerprints in the same group.
3. The method of claim 2, wherein the step of dynamically calculating a current similarity threshold based on the read similar fingerprints comprises: Calculating the overall similarity distribution between the read similar fingerprints; and calculating a preset statistic of the overall similarity distribution, and taking the preset statistic as the current similarity threshold.
4. A method according to claim 3, wherein the step of grouping the read similar fingerprints comprises: And based on the current similarity threshold, assigning similar fingerprints with similarity higher than the current similarity threshold to the same group through a local sensitive hash algorithm or a clustering algorithm.
5. The method of any one of claims 1 to 4, wherein prior to each of the fixed-length data writes, the method further comprises: Identifying whether each fixed-length datum is a predefined special datum; If the fixed-length data is the special data, a predefined fingerprint is directly allocated to the fixed-length data and used as a similar fingerprint to be written into the storage medium together with the fixed-length data.
6. The method of claim 5, wherein the special data comprises data having a content of all 0 s or all 1 s.
7. The method according to any one of claims 1 to 4, wherein the preset cut mode is a uniform cut mode; And/or the preset sampling mode is an equidistant sampling mode.
8. A fixed-length data deduplication apparatus, comprising: the device comprises a similar fingerprint generation module, a sampling data block selection module, a hash algorithm determination module, a calculation cost reduction module and a combination module, wherein the similar fingerprint generation module is used for dividing each fixed-length data into a plurality of sub-data blocks according to a preset dividing mode before writing the fixed-length data, selecting a plurality of sampling data blocks from the plurality of sub-data blocks according to a preset sampling mode, determining sub-fingerprints corresponding to the sampling data blocks based on the preset hash algorithm, wherein the calculation cost of the preset hash algorithm is lower than Jiang Haxi algorithm; the data writing module is used for writing the fixed-length data and the similar fingerprints thereof into the storage medium; And the deduplication execution module is used for reading similar fingerprints of each piece of data to be deduplicated prune long from the storage medium in a background deduplication stage, and performing data deduplication based on the similarity of the read similar fingerprints.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

Description

Fixed-length data deduplication method, device, equipment and program product Technical Field The present application relates to the field of data storage technologies, and in particular, to a fixed-length data deduplication method, a fixed-length data deduplication apparatus, a computer device, and a computer program product. Background In data storage, data deduplication (Data Deduplication) is a common data reduction technique that helps the storage system reduce the impact of redundant data on storage costs. According to the time when the deduplication operation occurs, online deduplication and background deduplication can be classified, wherein background deduplication refers to that when two identical data are written into a storage system, the data cannot be immediately deduplicated, and the system periodically executes repeated data identification and deletion operation in the background after the data is dropped. Referring to FIG. 1, two Data 1 are the same Data to be deleted, LBA1 and LBA2 represent two different logical addresses, FP represents a fingerprint, data g represents garbage Data, hash represents a Hash algorithm for computing the fingerprint, and referring to FIG. 1, it can be seen that a scheme of conventional background deletion is that two Data are first read from a storage medium, then the fingerprint of each Data is computed, if the fingerprints are the same, the repeated Data are deleted, and at this time, the corresponding Data in the disk are marked as garbage Data. The scheme has the inherent defects that the background deleting process needs to read all original data to be checked from the disk to calculate fingerprints although the saving of the storage space is realized, and if the read data is not repeated data, the disk I/O operation and the attached calculation overhead thereof are pure performance loss, and the disk bandwidth and CPU resources applied to processing the front-end business I/O are seriously occupied, so that the overall performance of the storage system is obviously and negatively influenced. Another background deduplication scheme for solving the above problem is to pre-locate the fingerprint calculation link before the data is dropped, as shown in fig. 2. In the scheme, the fingerprint is calculated before the data is written, then the data and the fingerprint are written into the disk together, and when the background is deleted, the system directly compares the fingerprints without reading the disk data again. However, the algorithm usually adopted in the fingerprint pre-calculation is relatively high in calculation intensity and high in CPU resource consumption, which directly causes a significant increase in delay of a data writing path, and seriously affects the performance of front-end writing operation. Therefore, a new data deduplication method is urgently needed in the field, and can avoid a large amount of background data reading and expensive front-end fingerprint calculation at the same time, so that the deduplication efficiency is ensured, and meanwhile, the interference on the read-write performance of a storage system is reduced to the greatest extent. Disclosure of Invention In view of the foregoing, it is desirable to provide a fixed-length data deduplication method, apparatus, computer device, and computer program product. In a first aspect, the present application provides a fixed-length data deduplication method, where the method includes: Before writing each fixed-length data, dividing each fixed-length data into a plurality of sub-data blocks according to a preset dividing mode; Selecting a plurality of sampling data blocks from the plurality of sub data blocks according to a preset sampling mode; Determining sub fingerprints corresponding to the sampling data blocks based on a preset hash algorithm, wherein the calculation cost of the preset hash algorithm is lower than Jiang Haxi algorithm; combining all the sub-fingerprints according to a preset combination mode to obtain similar fingerprints of the fixed-length data; And writing the fixed-length data and the similar fingerprints thereof into a storage medium, reading the similar fingerprints of the to-be-repeated prune long data from the storage medium in a background deleting stage, and deleting the data based on the similarity of the read similar fingerprints. In one embodiment, the step of deleting the data based on the similarity of the read similar fingerprints includes: dynamically calculating a current similarity threshold based on the read similar fingerprints; based on the current similarity threshold, grouping the read similar fingerprints to gather similar fingerprints with similarity higher than the current similarity threshold into the same group; And deleting the data to be deleted prune long which are consistent with the similar fingerprints in the same group. According to the scheme, unnecessary disk reading and fingerprint comparison operations are greatly redu