CN-119356943-B - Method and device for determining backup nodes in global deduplication storage scene

CN119356943BCN 119356943 BCN119356943 BCN 119356943BCN-119356943-B

Abstract

The application relates to a method and a device for determining backup nodes in a global deduplication storage scene. The method comprises the steps of determining a plurality of backup tasks, wherein the backup tasks comprise data to be backed up, each backup task corresponds to a fingerprint feature vector determined according to the data to be backed up, merging the data to be backed up of the backup tasks meeting merging conditions in each backup task according to feature distances among the fingerprint feature vectors corresponding to each backup task to generate a new plurality of backup tasks, determining similarity between the new plurality of backup tasks and a plurality of backup nodes according to the fingerprint feature vectors corresponding to the new plurality of backup tasks and the fingerprint feature vectors of cache data in the plurality of backup nodes, and sequentially determining target backup nodes for executing the backup tasks from the plurality of backup nodes according to the similarity and performance information of the backup nodes. By adopting the method, the determination efficiency of the backup node in the global deduplication storage scene can be improved.

Inventors

YU JIAN
YANG WEIMIN
CHEN MENGYU
MA LIKE
WANG ZIJUN

Assignees

安徽鼎甲计算机科技有限公司

Dates

Publication Date: 20260505
Application Date: 20241010

Claims (10)

1. A method for determining backup nodes in a global deduplication storage scene is characterized by comprising the following steps: Determining a plurality of backup tasks, wherein the backup tasks comprise data to be backed up, and each backup task respectively corresponds to a fingerprint feature vector determined according to the data to be backed up; Combining the data to be backed up of the backup tasks conforming to the combination condition in each backup task according to the feature distance between the fingerprint feature vectors corresponding to each backup task, and generating a plurality of new backup tasks; Determining the similarity between the new backup tasks and the backup nodes according to the fingerprint feature vectors corresponding to the new backup tasks and the fingerprint feature vectors of the cache data in the backup nodes; And determining target backup nodes for executing the new backup tasks from the backup nodes in turn according to the similarity and the performance information of the backup nodes, wherein the target backup nodes are used for storing the data to be backed up of the new backup tasks and deleting the data to be backed up and the repeated data in the cache data.
2. The method of claim 1, wherein merging the data to be backed up of the backup tasks meeting the merging condition in each backup task according to the feature distance between the fingerprint feature vectors corresponding to each backup task, to generate a new plurality of backup tasks, includes: According to the feature distance between the fingerprint feature vectors corresponding to the backup tasks, determining two backup tasks with the minimum feature distance; If the total data amount of the combined data of the two backup tasks is smaller than the preset data amount, combining the two backup tasks to obtain combined backup tasks, returning to the step of determining the two backup tasks with the minimum feature distance according to the feature distance between the fingerprint feature vectors corresponding to the backup tasks until the task number of the new backup tasks is smaller than the preset classification number.
3. The method of claim 2, wherein the predetermined total data amount is determined according to the total data amount of the plurality of backup tasks and the predetermined number of classifications, and wherein the predetermined number of classifications is determined according to the performance information of the plurality of backup nodes and the number of nodes.
4. The method of claim 1, wherein sequentially determining the target backup node for performing the new plurality of backup tasks from the plurality of backup nodes based on the similarity and the performance information of the backup nodes, comprises: determining a target backup node for executing any backup task from the plurality of backup nodes according to the similarity between the any backup task and the plurality of backup nodes and the performance information of the backup node aiming at any backup task in the new plurality of backup tasks; And a step of determining a target backup node for executing any one of the new plurality of backup tasks from the plurality of backup nodes based on the similarity between the any one backup task and the plurality of backup nodes and the performance information of the backup node, wherein the backup task is the new plurality of backup tasks, and the backup node is the new plurality of backup nodes, and the target backup node is returned to any one of the new plurality of backup tasks.
5. The method of claim 1, wherein sequentially determining the target backup node for performing the new plurality of backup tasks from the plurality of backup nodes based on the similarity and the performance information of the backup nodes, comprises: The method comprises the steps of acquiring storage cost of a backup node and efficiency cost of the backup node, wherein the storage cost is determined according to the weight-removing saving amount of the backup node, and the weight-removing saving amount comprises the data amount which can be saved by the backup node after deleting the data to be backed up and the repeated data in the cache data; determining the transmission cost of the backup node according to the similarity and the data movement cost of the backup node; determining efficiency parameters of the plurality of backup nodes according to the storage cost, the efficiency cost and the transmission cost; And sequentially determining target backup nodes for executing the new backup tasks from the backup nodes according to the efficiency parameters of the backup nodes.
6. The method according to claim 1, wherein before merging the data to be backed up of the backup tasks meeting the merge condition in each backup task according to the feature distance between the fingerprint feature vectors corresponding to each backup task, the method further comprises: Determining a plurality of data blocks to be backed up from the data to be backed up in the backup task; determining a plurality of hash values corresponding to any data block to be backed up; taking the minimum hash value in the hash values as the characteristic value of any data block to be backed up; And constructing fingerprint feature vectors corresponding to the backup tasks according to the feature values of the data blocks to be backed up.
7. The method of claim 1, wherein prior to determining the similarity between the new plurality of backup tasks and the plurality of backup nodes based on the fingerprint feature vector corresponding to the new plurality of backup tasks and the fingerprint feature vector of the cached data in the plurality of backup nodes, the method further comprises: determining the cache data according to the hot data in the backup node, wherein the hot data comprises data with access frequency higher than a frequency threshold value in the backup node; determining a plurality of node data blocks from the cached data; and determining the fingerprint feature vector of the cache data according to the fingerprints of the plurality of node data blocks.
8. A device for determining backup nodes in a global deduplication storage scenario, the device comprising: The task module is used for determining a plurality of backup tasks, wherein the backup tasks comprise data to be backed up, and each backup task corresponds to a fingerprint feature vector determined according to the data to be backed up; The merging module is used for merging the data to be backed up of the backup tasks meeting the merging conditions in each backup task according to the feature distance between the fingerprint feature vectors corresponding to each backup task, and generating a plurality of new backup tasks; the determining module is used for determining the similarity between the new backup tasks and the backup nodes according to the fingerprint feature vectors corresponding to the new backup tasks and the fingerprint feature vectors of the cache data in the backup nodes; and the backup module is used for sequentially determining target backup nodes for executing the new backup tasks from the backup nodes according to the similarity and the performance information of the backup nodes, wherein the target backup nodes are used for storing the data to be backed up of the new backup tasks and deleting the data to be backed up and the repeated data in the cache data.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

Description

Method and device for determining backup nodes in global deduplication storage scene Technical Field The present application relates to the field of big data technologies, and in particular, to a method, an apparatus, a computer device, a computer readable storage medium, and a computer program product for determining a backup node in a global deduplication storage scenario. Background With the dramatic increase in data volume, enterprises may reach tens or even hundreds of backup nodes for data to be backed up, which makes managing backup nodes a great challenge. The repeated data deletion is a widely adopted technology, the repeated data deletion can be abbreviated as repeated data deletion, the repeated data deletion can refer to the physical storage of the redundant data replaced by the logic reference, the single-instance storage of the data is realized, the storage space is remarkably saved, and the cost of a user is reduced. Distributed global deduplication is a technique that performs data deduplication in multiple storage locations (backup nodes). In the conventional technology, a user is usually required to select a reasonable backup node from a plurality of backup nodes to perform data backup, however, manually and automatically selecting the backup data backup node can cause low determination efficiency of the backup node in a global deduplication storage scene. Disclosure of Invention In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, a computer readable storage medium, and a computer program product for determining a backup node in a global deduplication storage scenario, which can improve the determination efficiency of the backup node in the global deduplication storage scenario. In a first aspect, the present application provides a method for determining a backup node in a global deduplication storage scenario, including: Determining a plurality of backup tasks, wherein the backup tasks comprise data to be backed up, and each backup task respectively corresponds to a fingerprint feature vector determined according to the data to be backed up; Combining the data to be backed up of the backup tasks conforming to the combination condition in each backup task according to the feature distance between the fingerprint feature vectors corresponding to each backup task, and generating a plurality of new backup tasks; Determining the similarity between the new backup tasks and the backup nodes according to the fingerprint feature vectors corresponding to the new backup tasks and the fingerprint feature vectors of the cache data in the backup nodes; and sequentially determining target backup nodes for executing the backup tasks from the plurality of backup nodes according to the similarity and the performance information of the backup nodes, wherein the target backup nodes are used for storing the data to be backed up of the backup tasks and deleting the data to be backed up and the repeated data in the cache data. In one embodiment, the merging the data to be backed up of the backup tasks meeting the merging condition in each backup task according to the feature distance between the fingerprint feature vectors corresponding to each backup task, to generate a new plurality of backup tasks, includes: According to the feature distance between the fingerprint feature vectors corresponding to the backup tasks, determining two backup tasks with the minimum feature distance; If the total data amount of the combined data of the two backup tasks is smaller than the preset data amount, combining the two backup tasks to obtain a plurality of new backup tasks, and returning to the step of determining the two backup tasks with the minimum feature distance according to the feature distance between the fingerprint feature vectors corresponding to the backup tasks until the task number of the plurality of new backup tasks is smaller than the preset classification number. In one embodiment, the preset data amount is determined according to the data amounts of the backup tasks and the preset classification number, and the preset classification number is determined according to the performance information of the backup nodes and the node number. In one embodiment, the determining, sequentially from the plurality of backup nodes, the target backup node for performing the backup task according to the similarity and the performance information of the backup node includes: determining a target backup node for executing any backup task from the plurality of backup nodes according to the similarity between the any backup task and the plurality of backup nodes and the performance information of the backup node aiming at any backup task in the new plurality of backup tasks; And a step of determining efficiency parameters of the plurality of backup nodes according to the similarity between the arbitrary backup task and the plurality of backup nodes and the performance information of