CN-122018804-A - Cloud computing-oriented big data asset storage and retrieval method and system
Abstract
The invention relates to the technical field of cloud computing data storage and retrieval, and discloses a cloud computing-oriented big data asset storage and retrieval method and a cloud computing-oriented big data asset storage and retrieval system. The method can effectively capture the instantaneous burst use mode formed by the high-intensity access of the micro service unit and the automation task to the specific data content in a very short time, and solves the problem of misjudgment of instantaneous peak access caused by the adoption of a longer information collection period and a statistical range in the prior art.
Inventors
- LI GAOSONG
- LI YONGBO
- LI DINGWEI
Assignees
- 深圳市鼎皓达科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260130
Claims (10)
- 1. The cloud computing-oriented big data asset storage and retrieval method is characterized by comprising the following steps of: Acquiring access event information of a data block, wherein the access event information comprises an identifier, an access type and access time of the data block; According to the access event information, the instantaneous access intensity of the data block is evaluated in a preset short time window, and when the instantaneous access intensity reaches a preset intensity threshold, an access peak indication is generated; Responding to the access peak indication, starting an evaluation process for sampling access events in a short time period and evaluating the activity level in a preset statistical window; When the result of the evaluation process continuously meets a preset liveness condition, marking the data block as a peak liveness state, and setting a retention period for the data block in the peak liveness state; During the retention period, preventing the data blocks in the peak active state from being adjusted from the high-performance storage layer to the archiving storage layer; and after the retention period is over, if the data block does not trigger the access peak indication again, evaluating the storage hierarchy of the data block according to the long-term average access activity.
- 2. The cloud computing-oriented big data asset storage and retrieval method of claim 1, wherein said suspending long-period average access frequency statistics of said data blocks and initiating an evaluation process of access event sampling for a short period of time and evaluating liveness within a preset statistics window in response to said access peak indication comprises: Configuring a data block access counter inside each storage node of the high-performance storage layer, wherein the data block access counter is used for recording the access times of the data block; When the access times recorded by the data block access counter reach a preset instantaneous peak threshold, updating local metadata of the data block, and marking the data block as a peak active state; Triggering a state update callback mechanism, notifying a local cold and hot judging module, giving the highest priority to the peak active state of the data block through the local cold and hot judging module, and adjusting a storage level management strategy of the data block; setting a retention period for the data blocks in the peak active state, wherein the retention period ensures that the data blocks in the peak active state stay in the high-performance storage layer; And re-evaluating the liveness of the data block after the retention period is over.
- 3. The cloud computing-oriented big data asset storage and retrieval method of claim 1, wherein said preventing the data blocks of said peak active state from being trimmed from a high performance storage tier to an archive storage tier during said retention period comprises: dividing the high-performance storage layer into a plurality of virtual storage partitions, and binding one or a group of data sets for each virtual storage partition; Predicting capacity requirements of the data sets in each virtual storage partition, and reserving physical capacity for each virtual storage partition; Placing the data blocks in the peak active state into the physical resource range corresponding to the bound virtual storage partition; And carrying out resource isolation on each virtual storage partition, and starting a cross-partition resource scheduling and priority arbitration mechanism when the virtual storage partition is in a predicted peak period so as to prevent the data blocks in the peak active state from being adjusted from the high-performance storage layer to the archive storage layer.
- 4. The cloud computing-oriented big data asset storage and retrieval method of claim 1, wherein said preventing the data blocks of said peak active state from being trimmed from a high performance storage tier to an archive storage tier during said retention period comprises: identifying the peak active data blocks located within the compromised storage area; Determining an emergency protection strategy according to the service priority, the data sensitivity and the fault type of the peak active data block; executing emergency data processing tasks, and copying or migrating the affected peak active data blocks to other high-performance storage areas or performing encryption backup; Updating metadata and data access index of the peak active data block and reactivating peak active retention period of the peak active data block.
- 5. The cloud computing-oriented big data asset storage and retrieval method of claim 1, wherein said preventing the data blocks of said peak active state from being trimmed from a high performance storage tier to an archive storage tier during said retention period comprises: continuously monitoring access event information of the data block, wherein the access event information comprises an identifier, an access type and access time of the data block; calculating the liveness change trend of the data block according to the access event information; Identifying the matching condition of the liveness change trend and the current retention period setting; Dynamically adjusting the retention period according to the liveness change trend and a preset adjustment rule to obtain an adjusted retention period; Updating metadata of the data block, the metadata including an adjusted retention period; and preventing the data blocks in the peak active state from being adjusted from the high-performance storage layer to the archive storage layer during the adjusted holding period.
- 6. The cloud computing-oriented big data asset storage and retrieval method of claim 1, wherein said preventing the data blocks of said peak active state from being trimmed from a high performance storage tier to an archive storage tier during said retention period comprises: An independent storage hierarchy management instruction channel is deployed, dedicated network bandwidth resources are configured through the independent storage hierarchy management instruction channel, and a high-priority data transmission protocol is adopted; when a system generates an instruction for preventing the data block from being adjusted from the high-performance storage layer to the archiving storage layer, transmitting the instruction through the independent storage level management instruction channel, and checking the instruction in real time to obtain a checking result; According to the verification result, configuring an instruction execution confirmation mechanism at the receiving end of the independent storage hierarchy management instruction channel, and returning a confirmation signal to the transmitting end of the independent storage hierarchy management instruction channel after successfully receiving and processing the instruction; And if the confirmation signal is not received within the preset time, triggering an instruction retransmission mechanism to retransmit the instruction.
- 7. The cloud computing-oriented big data asset storage and retrieval method of claim 1, wherein said preventing the data blocks of said peak active state from being trimmed from a high performance storage tier to an archive storage tier during said retention period comprises: Each tenant is allocated with independent storage resource quota; Setting a global highest execution priority for instructions that block the data block from being tuned from the high performance storage layer to the archive storage layer; when a system receives an instruction preventing the data block from being adjusted from the high-performance storage layer to the archiving storage layer, the instruction is routed to a special instruction processing queue which is independent of the conventional data operation queues of other tenants and has a priority scheduling right; A policy arbitration module is deployed on each storage node, so that the policy arbitration module monitors the resource use condition and the instruction execution state among tenants in real time; when the policy arbitration module detects that the instruction for preventing the data block from being adjusted from the high-performance storage layer to the archive storage layer conflicts with data migration or resource allocation operation of a high-priority tenant, the policy arbitration module forcedly interrupts or delays operation of other tenants; and providing a special communication channel for the instruction, wherein the special communication channel configures independent network bandwidth and adopts a protocol for guaranteeing the reliability of instruction transmission.
- 8. The cloud computing-oriented big data asset storage and retrieval method of claim 1, wherein said preventing the data blocks of said peak active state from being trimmed from a high performance storage tier to an archive storage tier during said retention period comprises: Continuously acquiring the running state information of each storage medium in the high-performance storage layer, wherein the running state information comprises the read-write error rate, the number of bad blocks, the residual life and the temperature of the storage medium; According to the running state information, the health state of the storage medium is evaluated, and when the health state is lower than a preset health threshold value, a medium abnormality early warning is generated; identifying the peak active data block in response to the medium abnormality pre-warning, determining a data protection level according to the service priority and the data sensitivity of the peak active data block, and determining emergency treatment operation according to the type of the medium abnormality pre-warning; executing the emergency processing operation, copying or migrating the affected peak active data blocks to other healthy storage media, or carrying out data backup; updating the metadata and data access index of the peak active data block and reactivating the retention period.
- 9. The cloud computing-oriented big data asset storage and retrieval method of claim 1, wherein said preventing the data blocks of said peak active state from being trimmed from a high performance storage tier to an archive storage tier during said retention period comprises: A data consistency coordinator is deployed, wherein the data consistency coordinator configures a cross-cloud data transmission channel, the cross-cloud data transmission channel adopts a data slicing transmission mode, the data slicing transmission mode performs slicing on the data block, and each data slicing is independently checked and encrypted; Deploying a data proxy module in the storage environment of each cloud service provider, wherein the data proxy module identifies and adapts to the storage interfaces and data processing logic of different cloud service providers; Acquiring metadata and data content of the peak active data blocks from storage environments of different cloud service providers through the data agent module, wherein the metadata comprises checksum, version information and time stamps of the data blocks; According to the metadata and the data content of the peak active data blocks, performing cross-cloud data consistency check, wherein the cross-cloud data consistency check comprises comparison of data block check sums, version information and time stamps among different cloud service providers; when the cross-cloud data consistency check finds that the data is inconsistent, determining a consistency data version according to a preset conflict resolution strategy; Synchronizing the consistency data version into a storage environment of a cloud service provider with inconsistency through the data proxy module, and updating corresponding metadata; The data consistency coordinator continuously monitors the transmission state and the data integrity of the cross-cloud data transmission channel in the data synchronization process, and triggers the data retransmission operation when transmission is abnormal or data is damaged.
- 10. A cloud computing oriented big data asset storage and retrieval system, comprising: The access event information comprises an identifier, an access type and access time of the data block, wherein the access event information is used for evaluating the instantaneous access intensity of the data block in a preset short time window according to the access event information, and generating an access peak indication when the instantaneous access intensity reaches a preset intensity threshold; the response end is used for responding to the access peak indication, starting an evaluation process for sampling access events in a short time period and evaluating the activity degree in a preset statistical window, marking the data block as a peak activity state when the result of the evaluation process continuously meets a preset activity degree condition, and setting a holding period for the data block in the peak activity state; And after the retention period is over, if the access peak indication is not triggered again by the data block, the storage level of the data block is evaluated according to the long-term average access activity.
Description
Cloud computing-oriented big data asset storage and retrieval method and system Technical Field The invention relates to the technical field of cloud computing data storage and retrieval, in particular to a cloud computing-oriented big data asset storage and retrieval method and system. Background In modern cloud computing environments, in order to effectively manage mass data and balance storage costs and access speeds, a data tiered storage policy is typically employed. This strategy relies on a determination of how active the data is in use, depositing frequently used data (common data) on the high performance storage device, and migrating less used data (less common data) to lower cost archival storage. However, with the continuous development of the technical structure of the cloud service platform, especially the wide application of the micro service unit and the automation task, the conventional data liveness determination mechanism faces a serious challenge. These new applications often access specific data content with high intensity in a very short time, forming a usage pattern of instantaneous burst, and the existing mechanism is difficult to accurately capture such dynamic changes due to its inherent statistical manner, thus leading to erroneous judgment of the true liveness of the data, and further causing a series of performance and management problems. In a large cloud service platform, one of the core functions is to provide storage and search services of unstructured big data contents for enterprise clients. The platform was originally designed to serve a few large, business model relatively stable enterprise-level applications. The pattern in which data of these applications is used presents a persistent and stable feature, for example, a conventional Enterprise Resource Planning (ERP) system performs continuous read and write operations on historical transaction records all day long, and the frequency of data usage is not large in a long period of time. In order to optimize the storage cost and ensure the data use performance, a set of data layering storage mechanism is built in the platform, wherein whether the data is frequently used or not is a key part is judged. The key component of the data cold and hot judging system of the platform is a use time recorder. The logger is initially configured to take longer periods of time to collect information and a greater statistical range. For example, it may count the usage events of the data block once every few hours and calculate the average number of uses in a unit of one day. This design was reasonable at that time because it effectively made small fluctuations in those sustained, stable data usage patterns less noticeable, thereby providing a stable and accurate "liveness" assessment for those data that were or were not used for a long period of time. However, as the technical structure of the cloud service platform is continuously developed and the demands of customers for services are more and more complex, the ecology of applications inside the platform is fundamentally changed. Originally several large applications were gradually split into hundreds or even thousands of individual tiny service units. Meanwhile, the platform also introduces a large amount of automatic data analysis flow and instant business report generation work. These tiny service units and automation tasks operate in a manner that is quite different from traditional applications. They often occur at specific sets of time points, for example, at each full, early daily, or weekly beginning, a large number of tiny service units may be triggered simultaneously, with a large number of simultaneous read and write operations being performed on specific historical unstructured data sets to generate instant analysis reports or update business metrics. These use behaviors are characterized by extremely short duration, possibly only a few minutes, but with very high intensity and frequency of simultaneous treatment, creating a "peak-to-use" pattern. Since the platform has adopted the originally designed usage recorder with a long information collection period and statistical range, it presents a significant limitation in the face of this new, short-time, high-intensity usage peak. When a data set has undergone tens of thousands of uses in a very short period of time (e.g., 5 minutes), but the statistical range of the logger is still 24 hours, these high-intensity use events will be averaged over a 24-hour period. This results in a calculated average number of uses that is much lower than the actual instantaneous use strength. As a result, data that is actually used by the periodic high frequency is erroneously judged by the system to be "unusual" data. The misjudgment on the real use mode of the data makes the knowledge of the data 'liveness' of the system obviously different from the actual use condition. Depending on the data lifecycle management rules that the platform has