CN-120687304-B - Large-scale data distributed storage and processing method and system

CN120687304BCN 120687304 BCN120687304 BCN 120687304BCN-120687304-B

Abstract

The invention discloses a large-scale data distributed storage and processing method and a system, which belong to the technical field of data storage, and the technical scheme is characterized in that the method comprises the steps of storing full data to be accessed in a main node; the method comprises the steps of dividing full data into a plurality of data units, predicting a first access frequency of each data unit according to a time sequence model, distributing at least one data unit to a copy node according to the first access frequency, wherein each copy node at least comprises one data unit, and each data unit at least comprises two data units, so that partial access requests in the next access period at least reach partial copy nodes.

Inventors

ZHU YING
Xiao Zetong
MA MENGMENG
LIU GUANGMING
LU CHEN
HUANG YINGYING

Assignees

上海燊海益众数字科技有限公司

Dates

Publication Date: 20260508
Application Date: 20250611

Claims (6)

1. A method for distributed storage and processing of large-scale data, comprising: storing the full data to be accessed in the master node; Dividing the full data into a plurality of data units, and predicting a first access frequency of each data unit according to a time sequence model, wherein the first access frequency is the access frequency in the next access period; Distributing at least one data unit to the duplicate nodes according to the first access frequency, wherein each duplicate node at least comprises one data combination, and each data combination at least comprises two data units, so that partial access requests in the next access period at least reach partial duplicate nodes; Wherein said allocating at least one data unit to a replica node according to said first access frequency comprises: Acquiring a plurality of adjacent replica nodes and a second access frequency of each data unit, wherein the second access frequency is the access frequency in the adjacent last access period, and the plurality of adjacent replica nodes are a plurality of replica nodes corresponding to the adjacent last access period; According to the second access frequency and the first access frequency, at least one data unit is distributed to a duplicate node; wherein said allocating at least one data unit to a replica node according to said second access frequency and said first access frequency comprises: Determining a first reconstructed data unit and a second reconstructed data unit in the plurality of data units according to the difference between the second access frequency and the first access frequency, wherein the first reconstructed data unit is a data unit with the second access frequency being greater than the first access frequency, and the second reconstructed data unit is a data unit with the second access frequency being less than the first access frequency; distributing the second reconstruction data unit to a duplicate node according to the first reconstruction data unit; wherein the allocating the second reconstructed data unit to the duplicate node according to the first reconstructed data unit includes: determining a plurality of copy nodes where the first reconstruction data unit is currently located, and determining at least one replacement copy node from the plurality of copy nodes where the first reconstruction data unit is currently located; Replacing the first reconstructed data unit in the replacement replica node with the second reconstructed data unit; Wherein the allocating the second reconstructed data unit to the duplicate node according to the first reconstructed data unit further includes: determining a plurality of copy nodes where the first reconstruction data unit is currently located, and deleting at least one first reconstruction data unit in the plurality of copy nodes where the first reconstruction data unit is currently located; And distributing the second reconstructed data units to the duplicate nodes according to the first access frequency of the second reconstructed data units, specifically, dividing the plurality of second reconstructed data units into a plurality of data combinations according to the first access frequency, and distributing corresponding duplicate nodes for each data combination according to the number of the required nodes.
2. The method according to claim 1, wherein a first access frequency of each data unit in each data unit is within a preset standard, and the number of times each data unit is stored in the plurality of replica nodes is determined according to the first access frequency, and the data combinations corresponding to each replica node are the same or different.
3. A method of mass data distributed storage and processing according to claim 1, wherein said forwarding at least a portion of the access requests in the next access period to a portion of the replica nodes comprises: And determining a current data unit according to the access request, and transferring part of the access request to at least part of the copy nodes according to the number of the copy nodes where the current data unit is located and a preset value.
4. The method of claim 1, wherein predicting the first access frequency of each data unit according to the timing model comprises: Acquiring a history access record of each data unit, wherein the history access record comprises access time and access times of each data unit; and outputting the first access frequency of each data unit according to the historical access record and the time sequence model.
5. A method of distributed storage and processing of large-scale data according to claim 1, wherein each of said data units is stored in at least two replica nodes, each of said replica nodes corresponding to at least one master node.
6. A large-scale data distributed storage and processing system for implementing a large-scale data distributed storage and processing method according to any one of claims 1-5, comprising: The storage module is used for storing the full data to be accessed in the master node; The prediction module is used for dividing the full data into a plurality of data units and predicting a first access frequency of each data unit according to a time sequence model, wherein the first access frequency is the access frequency in the next access period; and the access module is used for distributing at least one data unit to the duplicate nodes according to the first access frequency, wherein each duplicate node at least comprises one data combination, and each data combination at least comprises two data units, so that partial access requests in the next access period are transferred to at least partial duplicate nodes.

Description

Large-scale data distributed storage and processing method and system Technical Field The invention relates to the technical field of data storage, in particular to a large-scale data distributed storage and processing method and system. Background When big data and cloud computing rapidly develop, internet application scenes are increasingly rich, and a large amount of high-frequency access data can be generated in scenes such as e-commerce promotion, hot spot burst of social platform, large-scale online live broadcast and the like. In order to ensure continuous stability of data services, distributed data management architecture is widely used. However, current distributed data management modes have significant drawbacks. When data access peaks occur, most of the system still accesses the same data position in a centralized way, and other data backups are only used as emergency schemes, so that the problem that the current distributed data management mode cannot solve the problems of high concurrency and idle access of copy resources at the same time, and therefore the prior art has the defects. Disclosure of Invention Aiming at the defects existing in the prior art, the invention aims to provide a large-scale data distributed storage and processing method and system, and only part of data information is stored in each copy node through a new copy structure, after a current data unit is obtained, the copy node or a main node is accessed according to the storage position of the current data unit, so that the use rate of the copy node can be improved while the resource burden of the copy node is reduced. In order to achieve the above purpose, the present invention provides the following technical solutions: The invention provides a large-scale data distributed storage and processing method, which comprises the following steps: storing the full data to be accessed in the master node; Dividing the full data into a plurality of data units, and predicting a first access frequency of each data unit according to a time sequence model, wherein the first access frequency is the access frequency in the next access period; and distributing at least one data unit to the duplicate nodes according to the first access frequency, wherein each duplicate node at least comprises one data combination, and each data combination at least comprises two data units, so that partial access requests in the next access period are transferred to at least partial duplicate nodes. As a further improvement of the present invention, the first access frequency of each data unit in each data combination is within a preset standard, and the number of times of storing each data unit in the plurality of duplicate nodes is determined according to the first access frequency, and the data combinations corresponding to each duplicate node are the same or different. As a further improvement of the present invention, said assigning at least one data unit to a replica node according to said first access frequency comprises: Acquiring a plurality of adjacent replica nodes and a second access frequency of each data unit, wherein the second access frequency is the access frequency in the adjacent last access period, and the plurality of adjacent replica nodes are a plurality of replica nodes corresponding to the adjacent last access period; And according to the second access frequency and the first access frequency, at least one data unit is allocated to the duplicate node. As a further refinement of the invention, said assigning at least one data unit to a replica node according to said second access frequency and said first access frequency comprises: Determining a first reconstructed data unit and a second reconstructed data unit in the plurality of data units according to the difference between the second access frequency and the first access frequency, wherein the first reconstructed data unit is a data unit with the second access frequency being greater than the first access frequency, and the second reconstructed data unit is a data unit with the second access frequency being less than the first access frequency; And distributing the second reconstruction data unit to a copy node according to the first reconstruction data unit. As a further improvement of the present invention, said allocating said second reconstructed data unit to a replica node according to said first reconstructed data unit comprises: determining a plurality of copy nodes where the first reconstruction data unit is currently located, and determining at least one replacement copy node from the plurality of copy nodes where the first reconstruction data unit is currently located; And replacing the first reconstruction data unit in the replacement replica node with the second reconstruction data unit. As a further improvement of the present invention, said allocating said second reconstructed data unit to a replica node according to said first reconstructed data unit further comp