CN-121350115-B - Big data mining method and system applied to cloud storage service
Abstract
The invention provides a big data mining method and a big data mining system applied to cloud storage service, which relate to the field of cloud storage service, and are characterized in that firstly, an association mapping relation between a user operation sequence and a storage node response sequence is constructed, and corresponding association between user operation and storage node response is established; the method comprises the steps of generating a mining analysis standard containing normal association features and conventional flow features based on the association mapping relation, carrying out link feature association mining on the mining analysis standard, identifying abnormal association modes to obtain an abnormal association mode set, positioning problem links of data transmission according to the abnormal association mode set, generating cloud storage service optimization instructions based on the problem links and sending the cloud storage service optimization instructions to a cloud storage management system to adjust data transmission strategies and storage node configuration, and effectively improving cloud storage service performance and stability.
Inventors
- CHEN PING
Assignees
- 四川龙章凤彩网络科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251021
Claims (10)
- 1. The big data mining method applied to cloud storage service is characterized by comprising the following steps: Building an association mapping relation between a user operation sequence and a storage node response sequence, wherein the user operation sequence comprises continuous operation records executed by a user on a cloud storage platform, the storage node response sequence comprises feedback records of each storage node in a cloud storage system on user operation, and corresponding association between the user operation and storage node response is built through the association mapping relation; Generating mining analysis benchmarks of the data transmission process based on the association mapping relation, wherein the mining analysis benchmarks comprise normal association characteristics of user operation and storage node response and conventional flow characteristics of data transmission; performing link characteristic association mining on the mining analysis standard, and identifying abnormal association modes deviating from the normal association characteristics and the normal flow characteristics to obtain an abnormal association mode set in the data transmission process; Positioning problem links of data transmission in cloud storage service according to the abnormal association mode set, wherein the problem links comprise abnormal links of a data transmission path and response delay links of storage nodes; And generating a cloud storage service optimization instruction based on the problem link, and sending the cloud storage service optimization instruction to a cloud storage management system to adjust a data transmission strategy and storage node configuration.
- 2. The big data mining method applied to cloud storage service according to claim 1, wherein the constructing the association mapping relation between the user operation sequence and the storage node response sequence comprises: Extracting operation attributes of each user operation in the user operation sequence to form an operation attribute set of the user operation sequence, wherein the operation attributes comprise operation types, operation initiation time and operation related data quantity; Extracting response attribute of each storage node response in the storage node response sequence to form a response attribute set of the storage node response sequence, wherein the response attribute comprises a response type, response initiating time and response data quantity; analyzing the time association relation between the operation initiation time in the operation attribute set and the response initiation time in the response attribute set, and determining a preliminary association pair of user operation and storage node response, wherein the time difference is in a preset range; analyzing the matching relation between the operation type and the response type in the preliminary association pair, and reserving the preliminary association pair with the operation type matched with the response type as a candidate association pair; analyzing the corresponding relation between the data quantity related to the operation of the candidate association pair and the response data quantity, and reserving the candidate association pair with the operation related to the matching of the data quantity and the response data quantity as a final association pair; constructing an association mapping relation between a user operation sequence and a storage node response sequence based on the final association pair, wherein the association mapping relation records identification and attribute information of storage node response corresponding to each user operation in a table form; carrying out association strength evaluation on each association item in the association mapping relation, wherein the association strength evaluation calculates an association strength value based on the tightness degree of the time association relation, the accuracy of type matching and the coincidence degree corresponding to the data volume; and sorting the association items in the association mapping relation according to the association strength values, and reserving the association items with the association strength values meeting the requirements to form the association mapping relation between the final user operation sequence and the storage node response sequence.
- 3. The big data mining method applied to cloud storage service according to claim 2, wherein the performing the association strength evaluation on each association entry in the association mapping relationship, the association strength evaluation calculating an association strength value based on the tightness degree of the time association relationship, the accuracy of type matching, and the corresponding fitness of the data amount, includes: determining time association relation weight, type matching weight and data quantity corresponding weight of association strength evaluation, wherein the sum of the time association relation weight, the type matching weight and the data quantity corresponding weight is a preset weight sum value; For each associated item, extracting operation initiation time of user operation in the associated item and response initiation time of storage node response, calculating time difference between the operation initiation time and the response initiation time, and comparing the time difference with a preset ideal time difference to obtain a time difference deviation degree; Determining the tightness degree score of the time association according to the deviation degree of the time difference, wherein the smaller the deviation degree of the time difference is, the higher the tightness degree score of the time association is, and the larger the deviation degree of the time difference is, the lower the tightness degree score of the time association is; multiplying the tightness degree score of the time association relationship with the weight of the time association relationship to obtain a time association contribution value; extracting the operation type of user operation and the response type of the response of the storage node in the associated item, analyzing the matching degree of the operation type and the response type of the response of the storage node to obtain a type matching accuracy score, and multiplying the type matching accuracy score by a type matching weight to obtain a type matching contribution value; Extracting the operation related data quantity of user operation and the response data quantity of storage node response in the associated item, and calculating the coincidence degree proportion of the data quantity and the response data quantity, wherein the coincidence degree proportion is the ratio of the response data quantity to the operation related data quantity; Determining a corresponding goodness of fit score of the data volume according to the goodness of fit proportion, wherein the closer the goodness of fit proportion is to a preset ideal goodness of fit proportion, the higher the corresponding goodness of fit score of the data volume is, the more the goodness of fit proportion deviates from the preset ideal goodness of fit proportion, and the lower the corresponding goodness of fit score of the data volume is; multiplying the data volume corresponding fitness score by a data volume corresponding weight to obtain a data volume corresponding contribution value; And adding the time association contribution value, the type matching contribution value and the data quantity corresponding contribution value to obtain an association strength value of the association item, carrying out normalization processing on the calculated association strength value, and recording the association strength value and each dimension contribution value of each association item to form an association strength evaluation result table.
- 4. The big data mining method applied to cloud storage service according to claim 1, wherein the generating a mining analysis reference of a data transmission process based on the association mapping relation comprises: extracting association features responded by all user operations and storage nodes from the association mapping relation to form an association feature set, wherein the association features comprise association time intervals, association type matching degree and association data volume coincidence degree; Carrying out statistical analysis on the association time intervals in the association feature set, obtaining the distribution features of the association time intervals, and determining the common range of the association time intervals as the normal association time interval range; Carrying out statistical analysis on the association type matching degree in the association feature set, obtaining the distribution feature of the association type matching degree, and determining the common range of the association type matching degree as the normal association type matching degree range; Carrying out statistical analysis on the associated data volume fitness in the associated feature set to obtain the distribution feature of the associated data volume fitness, and determining the common range of the associated data volume fitness as the normal associated data volume fitness range; combining the normal association time interval range, the normal association type matching degree range and the normal association data volume matching degree range to form normal association characteristics of user operation and storage node response; Extracting a flow record of data transmission in cloud storage service to form a data transmission flow set, wherein the flow record comprises a complete step record of data transmission from a user terminal to a storage node; Step analysis is carried out on each data transmission flow in the data transmission flow set, standard steps and step sequences contained in each data transmission flow are identified, and a conventional step sequence of data transmission is determined; analyzing the execution time length of each step in the conventional step sequence, and acquiring a common range of the execution time length of each step as a conventional step execution time length range; combining the conventional step sequence and the conventional step execution duration range to form conventional flow characteristics of data transmission; and integrating the normal association features with the conventional flow features to generate mining analysis benchmarks of the data transmission process, wherein the mining analysis benchmarks record specific ranges of the normal association features and specific contents of the conventional flow features in the form of documents.
- 5. The big data mining method applied to cloud storage service according to claim 4, wherein the analyzing the execution duration of each step in the regular step sequence, and obtaining the common range of the execution duration of each step as the regular step execution duration range, includes: extracting all complete data transmission flow records containing conventional step sequences from the data transmission flow set to form a conventional flow record subset; extracting, for each data transmission flow record in the subset of regular flow records, an execution start time and an execution end time of each step contained in the sequence of regular steps; Calculating the difference value between the execution ending time and the execution starting time of each step to obtain the execution duration of each step in the data transmission flow record; Summarizing the execution time length of the same step in all the data transmission flow records according to the step sequence in the conventional step sequence to form an execution time length set of each step; Carrying out data cleaning on the execution time length set of each step, carrying out statistical analysis on the cleaned execution time length set of each step, calculating the average value, the median and the standard deviation of the execution time length of each step, and determining the common range of the execution time length of each step according to the average value and the standard deviation, wherein the lower limit of the common range is the average value minus the standard deviation of a preset multiple, and the upper limit of the common range is the average value plus the standard deviation of the preset multiple; if the number of the execution duration data contained in the common range determined by the average value and the standard deviation does not meet the preset proportion requirement, adjusting the preset multiple, and recalculating the common range until the common range can cover the execution duration data of the preset proportion in the execution duration set of the step; Recording the determined common range of each step as the conventional step execution time length range of the step, verifying the conventional step execution time length ranges of all the steps, selecting part of data transmission flow records, checking whether the proportion of the execution time length of each step in the corresponding conventional step execution time length range meets the preset verification proportion requirement, if so, determining that the conventional step execution time length range is effective, and if not, re-analyzing the execution time length set, and adjusting the calculation mode of the common range until the conventional step execution time length range is effective.
- 6. The big data mining method applied to cloud storage service according to claim 1, wherein the performing link characteristic association mining on the mining analysis standard, identifying abnormal association modes deviating from the normal association characteristics and the normal flow characteristics, and obtaining an abnormal association mode set in a data transmission process includes: Extracting normal association features and conventional flow features in the mining analysis standard, decomposing the normal association features into association time interval standards, association type matching degree standards and association data volume consistency standards, and decomposing the conventional flow features into step sequence standards and step execution duration standards; Acquiring real-time associated data in the current data transmission process of the cloud storage service to form a real-time associated data set, wherein the real-time associated data comprises an associated record of the response of the current user operation and the storage node and a current data transmission flow record; Extracting current association features from the real-time association data set to form a current association feature set, wherein the current association features comprise a current association time interval, a current association type matching degree and a current association data volume matching degree; Comparing the current association time interval in the current association feature set with the association time interval standard, and identifying an association record corresponding to the current association time interval exceeding the association time interval standard range as a time abnormality association record; Comparing the matching degree of the current association type in the current association feature set with the association type matching degree standard, and identifying an association record corresponding to the matching degree of the current association type exceeding the association type matching degree standard range as a type abnormal association record; comparing the current associated data volume fitness in the current associated feature set with the associated data volume fitness standard, and identifying an associated record corresponding to the current associated data volume fitness exceeding the associated data volume fitness standard range as a data volume abnormal associated record; Extracting current data transmission flow characteristics from the real-time associated data set, wherein the current data transmission flow characteristics comprise a current step sequence and a current step execution time length, and a current data transmission flow characteristic set is formed; Comparing the current step sequence in the current data transmission flow characteristic set with the step sequence standard, and identifying a flow record corresponding to the current step sequence inconsistent with the step sequence standard as a step sequence abnormal flow record; Comparing the current step execution time length in the current data transmission flow characteristic set with the step execution time length standard, and identifying a flow record corresponding to the current step execution time length exceeding the step execution time length standard range as a step time length abnormal flow record; performing mode extraction on the time anomaly associated record, the type anomaly associated record, the data quantity anomaly associated record, the step sequence anomaly flow record and the step duration anomaly flow record, and identifying anomaly record groups with similar characteristics; Carrying out feature induction on each abnormal record group, summarizing the abnormal expression form and appearance rule of each abnormal record group, and forming an abnormal association mode; and integrating all the abnormal association modes to obtain an abnormal association mode set in the data transmission process, wherein the abnormal association mode set records the characteristic description and the corresponding abnormal record example of each abnormal association mode in a list form.
- 7. The big data mining method applied to cloud storage service according to claim 6, wherein the pattern extraction is performed on the time anomaly associated record, the type anomaly associated record, the data volume anomaly associated record, the step sequence anomaly flow record, and the step duration anomaly flow record, and identifying anomaly record groups with similar characteristics includes: Defining an abnormal record feature dimension, wherein the feature dimension comprises an abnormal time difference size, an abnormal time occurrence period and related user operation types according to a time abnormal association record; Aiming at the data volume abnormality association record, the characteristic dimension comprises the size of the data volume difference and the direction of the data volume difference; Aiming at the step sequence abnormal flow record, the characteristic dimension comprises missing steps, newly added steps and specific manifestations of disordered step sequence, and aiming at the step duration abnormal flow record, the characteristic dimension comprises a step identifier with a duration exceeding the standard and a step identifier exceeding the standard duration; marking the characteristics of all the abnormal records according to the characteristic dimensions, and adding corresponding characteristic labels for each abnormal record to form an abnormal record set with labels; Processing the marked abnormal record set by adopting a cluster analysis method, setting a cluster quantity range, and calculating the similarity between different abnormal records based on the characteristic labels of the abnormal records; dividing the abnormal records into different initial cluster groups according to the similarity, and calculating the characteristic similarity average value of the abnormal records in each initial cluster group and the characteristic similarity average value among different initial cluster groups; The clustering quantity is adjusted, clustering division is repeatedly carried out until the characteristic similarity average value of abnormal records in each clustering group reaches a preset similarity threshold value, and the characteristic similarity average value among different clustering groups is lower than the preset similarity threshold value, so that an effective abnormal record group is obtained; sorting all effective abnormal record groups to form abnormal record groups with similar characteristics, wherein each abnormal record group corresponds to one type of abnormal expression with common characteristics; And recording characteristic description information of each abnormal record group, wherein the characteristic description information comprises characteristic dimensions and characteristic value ranges shared by abnormal records of the abnormal record group, counting the number of the abnormal records in each abnormal record group, and marking the abnormal record groups with the number exceeding a preset number threshold as key abnormal record groups.
- 8. The big data mining method applied to the cloud storage service according to claim 1, wherein the locating the problem link of the data transmission in the cloud storage service according to the abnormal association pattern set includes: analyzing each abnormal association mode in the abnormal association mode set one by one, and extracting abnormal characteristics corresponding to each abnormal association mode, wherein the abnormal characteristics comprise abnormal association time characteristics, abnormal association type characteristics, abnormal data quantity characteristics and abnormal flow step characteristics; Aiming at the abnormal association time characteristics of each abnormal association mode, analyzing the reason that the abnormal association time is prolonged or shortened, judging whether the abnormal association time is related to the transmission efficiency of the data transmission path, and if the abnormal association time characteristics are related to the transmission efficiency of the data transmission path, marking potential problem links corresponding to the abnormal association modes as data transmission path related links; Aiming at the abnormal association type characteristics of each abnormal association mode, analyzing the reasons of unmatched abnormal association types, judging whether the abnormal association types are related to response type configuration of the storage node, and if the abnormal association types are related to the response type configuration of the storage node, marking potential problem links corresponding to the abnormal association modes as storage node response type configuration links; aiming at the abnormal data quantity characteristics of each abnormal association mode, analyzing the reasons of the abnormal association data quantity mismatch, judging whether the abnormal association data quantity mismatch is related to data loss or redundancy in the data transmission process, and if the abnormal association data quantity mismatch is related to the data loss or redundancy in the data transmission process, marking a potential problem link corresponding to the abnormal association mode as a data transmission integrity link; Analyzing the reasons of the abnormal flow steps according to the characteristics of the abnormal flow steps of each abnormal association mode, judging whether the abnormal flow steps are related to the response speed of the storage node, and if the abnormal flow steps are related to the response speed of the storage node, marking potential problem links corresponding to the abnormal association modes as storage node response delay links; Classifying and counting all marked potential problem links, counting the number of abnormal association modes corresponding to each potential problem link, and determining the potential problem links with the number of the corresponding abnormal association modes being greater than the set number as main problem links according to the counting result; further analyzing the concrete performance of the main problem link, and determining the concrete position and related components of the main problem link in the cloud storage system by combining the architecture information of the cloud storage service; And sorting the determined specific positions and the problem links related to the components to obtain problem links of data transmission in the cloud storage service, wherein the problem links comprise abnormal links of a data transmission path, response delay links of storage nodes and other problem links determined through analysis.
- 9. The big data mining method applied to cloud storage service according to claim 8, wherein the further analyzing the concrete manifestation of the main problem link includes: For a main problem link marked as a data transmission path related link, extracting data transmission path records in all abnormal association modes corresponding to the main problem link to form a data transmission path record set, wherein the data transmission path records comprise node sequences through which data is transmitted and transmission time among nodes; Segmenting and disassembling each data transmission path record in the data transmission path record set, and splitting each data transmission path into a plurality of continuous path segments, wherein each path segment corresponds to a transmission path between two adjacent nodes; Counting the occurrence times of each path segment in all data transmission path records and corresponding transmission time, and calculating the average transmission time of each path segment; Comparing the average transmission time of different path segments, identifying the path segments with average transmission time exceeding the preset difference range of the cloud storage system of the average transmission time of other path segments, and marking the path segments as suspected transmission efficiency problem path segments; collecting node equipment information corresponding to the suspected transmission efficiency problem path segment to form a node equipment information set, wherein the node equipment information comprises hardware configuration, network connection state and current load condition of a node; Analyzing whether hardware configuration in the node equipment information set meets data transmission requirements, whether a network connection state is stable, and whether the current load condition exceeds a load normal range preset by a cloud storage system; If the hardware configuration does not meet the data transmission requirement, the network connection state is unstable or the current load condition exceeds the load normal range preset by the cloud storage system, determining the path section with the suspected transmission efficiency problem as the path section with the transmission efficiency problem; For a main problem link marked as a storage node response delay link, extracting storage node identifiers and response time records in all abnormal association modes corresponding to the main problem link to form a storage node response record set; Performing de-duplication processing on the storage node identifiers in the storage node response record set to obtain a related storage node list, counting the response time record of each storage node in the storage node list, and calculating the average response time of each storage node; comparing the average response time of different storage nodes, identifying the storage nodes with average response time exceeding the preset difference range of the cloud storage system of the average response time of other storage nodes, and marking the storage nodes as suspected response delay storage nodes; Acquiring operation state information of the suspected response delay storage node to form an operation state information set, wherein the operation state information comprises central processing unit utilization rate, memory occupancy rate, disk read-write speed and network bandwidth occupancy condition; Analyzing whether the utilization rate of a central processor in the running state information set exceeds a normal running range of the utilization rate of the central processor preset by the cloud storage system, whether the occupancy rate of a memory exceeds the normal running range of the occupancy rate of the memory preset by the cloud storage system, whether the read-write speed of a disk is lower than the normal running range of the read-write speed of the disk preset by the cloud storage system, and whether the occupancy condition of network bandwidth exceeds the normal running range of the occupancy rate of the network bandwidth preset by the cloud storage system; If the utilization rate of the central processing unit exceeds the normal operation range of the utilization rate of the central processing unit preset by the cloud storage system, the memory occupancy rate exceeds the normal operation range of the memory occupancy rate preset by the cloud storage system, the disk read-write speed is lower than the normal operation range of the disk read-write speed preset by the cloud storage system or the network bandwidth occupancy condition exceeds the normal operation range of the network bandwidth preset by the cloud storage system, determining the suspected response delay storage node as the storage node with the response delay problem; recording the determined path segment with the transmission efficiency problem and the storage node with the response delay problem as the concrete expression of the main problem link.
- 10. A big data mining system for cloud storage service, comprising: A processor; a machine-readable storage medium storing machine-executable instructions for the processor; wherein the processor is configured to perform the big data mining method of any of claims 1to 9 applied to cloud storage traffic via execution of the machine executable instructions.
Description
Big data mining method and system applied to cloud storage service Technical Field The invention relates to the technical field of cloud storage services, in particular to a big data mining method and a big data mining system applied to cloud storage services. Background In the current cloud storage business field, with the explosive growth of data volume and the increasing diversification of data storage and access demands of users, the scale and complexity of a cloud storage system are also continuously improved. Cloud storage platforms need to handle various operation requests from a large number of users, which encompass uploading, downloading, modifying, deleting, etc. of data, and which are typically continuous and complex. Most of the existing cloud storage service management modes focus on capacity expansion and basic data access control of data storage, and an effective means is lacking for management and optimization of a data transmission process. During data transmission, because the cloud storage system includes numerous storage nodes, the relationship between user operations and storage node responses is complex and elusive. The traditional method is often simply used for recording feedback information of the user operation and the storage node, but does not deeply analyze internal association between the user operation and the storage node, and cannot establish an accurate corresponding relation between the user operation and the storage node response. Meanwhile, for the problems of abnormal conditions, such as abnormal data transmission paths, delayed response of storage nodes and the like, in the data transmission process, the problem links are difficult to quickly and accurately locate in the prior art. Due to the lack of comprehensive analysis and mining of the data transmission process, an effective optimization instruction cannot be generated to adjust the data transmission strategy and the storage node configuration, so that the performance and stability of the cloud storage service are affected, and the requirements of users on efficient and reliable cloud storage service are difficult to meet. Disclosure of Invention In view of the above-mentioned problems, in combination with the first aspect of the present invention, an embodiment of the present invention provides a big data mining method applied to cloud storage service, where the method includes: Building an association mapping relation between a user operation sequence and a storage node response sequence, wherein the user operation sequence comprises continuous operation records executed by a user on a cloud storage platform, the storage node response sequence comprises feedback records of each storage node in a cloud storage system on user operation, and corresponding association between the user operation and storage node response is built through the association mapping relation; Generating mining analysis benchmarks of the data transmission process based on the association mapping relation, wherein the mining analysis benchmarks comprise normal association characteristics of user operation and storage node response and conventional flow characteristics of data transmission; performing link characteristic association mining on the mining analysis standard, and identifying abnormal association modes deviating from the normal association characteristics and the normal flow characteristics to obtain an abnormal association mode set in the data transmission process; Positioning problem links of data transmission in cloud storage service according to the abnormal association mode set, wherein the problem links comprise abnormal links of a data transmission path and response delay links of storage nodes; And generating a cloud storage service optimization instruction based on the problem link, and sending the cloud storage service optimization instruction to a cloud storage management system to adjust a data transmission strategy and storage node configuration. In still another aspect, an embodiment of the present invention further provides a big data mining system applied to a cloud storage service, which is characterized by including: The system comprises a processor, a machine-readable storage medium for storing machine-executable instructions of the processor, wherein the processor is configured to execute the big data mining method applied to cloud storage service by executing the machine-executable instructions. Based on the above aspects, the embodiment of the invention can accurately establish the corresponding association between the user operation and the storage node response by constructing the association mapping relation between the user operation sequence and the storage node response sequence, fully covers the normal association characteristics of the user operation and the storage node response and the conventional flow characteristics of the data transmission based on the mining analysis standard of the data tran