CN-121979889-A - Behavior log time sequence data distributed storage and multidimensional analysis method for target users
Abstract
The invention relates to the technical field of data processing, in particular to a method for distributed storage and multidimensional analysis of behavior log time sequence data for a target user. The method comprises the steps of behavior log distributed ingestion, distributed data persistent storage, dynamic optimization of dimensional entropy sensing index and multi-dimensional low-delay query execution. The method comprises the steps of capturing multidimensional query dimension characteristics, cleaning and screening, constructing a dynamic decay function by combining behavior log aging, carrying out time sequence correction on original entropy values of each dimension, adaptively switching index strategies of high entropy dimension and low entropy dimension according to corrected entropy value distribution, constructing an index system containing independent local indexes and global initial indexes based on persistent log data with slicing marks, forming a dynamic optimization closed loop through query characteristics and performance data return, enabling an index structure to be accurately adapted to log aging change and query characteristic distribution, and improving retrieval adaptation capacity and efficient stability.
Inventors
- SUN LONG
- HU HONGJUN
- ZHU YAN
Assignees
- 上海聚告信息技术服务有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260128
Claims (10)
- 1. The method for distributed storage and multidimensional analysis of the behavior log time sequence data for the target user is characterized by comprising the following steps: s1, performing behavior log distributed ingestion, namely receiving a target user behavior log time sequence data stream in real time, performing data partition and integrity verification after peak clipping through a distributed buffer queue, and outputting standardized preprocessing log data to S2; S2, the distributed data persistence storage is carried out, namely the standardized pretreatment log data output by the S1 are received, the standardized pretreatment log data are persistence to a distributed cluster, a time-user two-dimensional slicing strategy is adopted to finish storage layout, and the persistence log data with slicing marks are output to the S3; S3, dynamically optimizing the dimension entropy sensing index, namely, receiving the persistent log data with the fragmentation marks output by S2, constructing an initial index, capturing multidimensional query dimension characteristics returned by S4, constructing a dynamic decay function based on the aging characteristics of the behavior log, carrying out time sequence correction on the dimension entropy value, adaptively switching index strategies of high entropy dimension and low entropy dimension according to the corrected entropy value distribution, and outputting an optimized index structure to S4; and S4, executing the multidimensional low-delay query, namely receiving the optimized index structure output by the step S3, executing a target user behavior analysis task, accelerating data retrieval through an index push-down and local aggregation mechanism, and simultaneously transmitting the dimension characteristics and performance data of the query back to the step S3.
- 2. The method for distributed storage and multidimensional analysis of behavioral log time-series data for a target user according to claim 1, wherein in S1, the behavioral log distributed intake specifically comprises: performing hash partition operation by taking a user identifier contained in the target user behavior log time sequence data stream as a partition key; performing peak clipping treatment on the target user behavior log time sequence data stream through a distributed buffer queue; sequentially executing log field non-empty verification, field format compliance verification and log time sequence continuity verification on the partitioned target user behavior log time sequence data stream; and (2) carrying out standardization processing on the target user behavior log time sequence data stream passing through all the verification, obtaining standardized preprocessing log data and outputting the standardized preprocessing log data to the S2.
- 3. The method for distributed storage and multidimensional analysis of behavioral log time-series data for a target user according to claim 2, wherein the step of persisting the distributed data in S2 comprises the steps of: S21, receiving standardized preprocessing log data output by the S1, and performing data format adaptation on the standardized preprocessing log data to enable the standardized preprocessing log data to meet storage access requirements of a distributed cluster; S22, performing time dimension slicing operation on standardized preprocessing log data with the format being matched, and dividing the standardized preprocessing log data into a plurality of time slicing data according to a preset time period; s23, executing user dimension slicing operation on each piece of time slicing data, dividing each piece of time slicing data into a plurality of pieces of user sub-slicing data by taking a user mark as a slicing basis, and completing the storage layout of a time-user double-dimension slicing strategy; S24, persisting all the user sub-sharding data which completes the two-dimensional sharding to the corresponding storage nodes of the distributed cluster; And S25, distributing a unique slicing identifier for each piece of user sub-slicing data, carrying out association binding on the slicing identifier and the corresponding piece of user sub-slicing data, obtaining persistent log data with the slicing identifier, and outputting the persistent log data to S3.
- 4. The method for distributed storage and multidimensional analysis of behavior log time-series data for target users according to claim 3, wherein the dynamic optimization of the dimension entropy sensing index in S3 comprises the following steps: s31, receiving the persistent log data with the fragmentation mark output by the S2, and building an initial index system based on the persistent log data with the fragmentation mark; S32, capturing the multidimensional query dimension characteristics returned by the S4 in real time, preprocessing the multidimensional query dimension characteristics, and screening effective characteristic data; s33, constructing a dynamic decay function based on aging characteristics of behavior logs Calculating the original entropy value of each dimension by combining the effective characteristic data obtained in S32 By a dynamic decay function For dimension original entropy value Performing time sequence correction to obtain a corrected dimension entropy value ; S34, according to the corrected dimension entropy value The distribution of the index strategy is adaptively switched between a high entropy dimension and a low entropy dimension; And S35, carrying out consistency check on the index structure subjected to strategy switching optimization, and outputting the check to S4 after passing the check.
- 5. The method for distributed storage and multidimensional analysis of behavior log time-series data for target users according to claim 4, wherein the step of constructing an initial index system based on the persistent log data with the shard mark in S31 comprises the following steps: S31.1, grouping according to the slicing identifications in the persistent log data with the slicing identifications to obtain user sub-slicing data sets corresponding to the slicing identifications; S31.2, traversing the persistent log data of each stripe of the stripe identification in the set aiming at each user sub-stripe data set, extracting all dimension fields contained in the persistent log data of each stripe of the stripe identification and corresponding physical data storage addresses of the persistent log data of the stripe identification in the distributed cluster, and establishing a one-to-one association mapping relation between the dimension fields and the corresponding physical data storage addresses to form index mapping entries; S31.3, constructing independent local indexes for each user sub-fragment data set based on the mapping relation, wherein the local indexes are associated with dimension fields, data storage addresses and fragment identifications; s31.4, summarizing local indexes of all user sub-fragment data sets, eliminating redundant index items and index conflicts through an index duplication elimination algorithm, and generating a global initial index.
- 6. The method for distributed storage and multidimensional analysis of behavior log time-series data for target users according to claim 5, wherein the preprocessing and effective feature screening of multidimensional query dimension features in S32 comprises the following steps: S32.1, capturing the multidimensional query dimension characteristics returned by the S4 in real time, wherein the multidimensional query dimension characteristics comprise dimension types related to query, feature values of each dimension and corresponding query frequencies; S32.2, cleaning the data of the captured multidimensional query dimension characteristics, and removing invalid characteristic values, repeated characteristic records and abnormal query frequency data; s32.3, screening out dimension types and corresponding characteristic values with query frequencies higher than a threshold according to a preset characteristic effectiveness threshold to obtain effective characteristic data; s32.4, storing the effective characteristic data in sequence according to the time stamp to form a characteristic time sequence data set.
- 7. The method for distributed storage and multidimensional analysis of behavior log time-series data for target users according to claim 6, wherein the step of constructing the dynamic decay function and correcting the dimensional entropy time-series in S33 comprises the following steps: S33.1, constructing a dynamic decay function based on aging characteristics of behavior logs The dynamic decay function Inquiring time-lapse distribution data calibration through a history log; s33.2, calculating the original entropy value of each dimension based on the characteristic time sequence data set obtained in the S32 ; S33.3, original entropy values of each dimension Dynamic decay functions with corresponding behavior logs Correlating the calculation results, completing time sequence correction, and obtaining corrected dimension entropy value 。
- 8. The method for distributed storage and multidimensional analysis of behavior log time-series data for target users according to claim 7, wherein the dimension original entropy value in S33.2 Comprises the following steps: S33.21, counting the query frequency of each characteristic value in the dimension according to the dimension type corresponding to each effective characteristic data, and determining the duty ratio distribution of each characteristic value in the dimension query; S33.22, calculating original entropy values of each dimension based on information theory entropy value core concept ; S33.23, calculating the original entropy value of each dimension And (5) performing rationality verification and eliminating abnormal entropy value data.
- 9. The method for distributed storage and multidimensional analysis of behavior log time-series data for target users according to claim 8, wherein the step of adaptively switching the indexing strategy based on the corrected entropy distribution in S34 comprises the following steps: s34.1, presetting a dimension entropy threshold The dimension entropy threshold The value of (1) is combined with the upper limit of the distributed cluster storage resource and the requirement of query performance to calibrate; s34.2, correcting each dimension to obtain a dimension entropy value And the dimension entropy value threshold value Comparing when When it is determined as high entropy dimension, when Judging the dimension as low entropy; s34.3, adopting a fine index strategy for the high entropy dimension and adopting a coarse granularity index strategy for the low entropy dimension; S34.4, based on the switched index strategy, reconstructing and optimizing the global initial index constructed in the S31, and updating the mapping relation between the index and the physical data storage address to form an optimized index structure.
- 10. The method for distributed storage and multidimensional analysis of behavioral log time-series data for a target user according to claim 9, wherein the multidimensional low-delay query execution in S4 comprises the steps of: s4.1, receiving an optimized index structure output by the S3, and checking the integrity and the validity of the optimized index structure; s4.2, loading the optimized index structure passing the verification to a query execution engine, and establishing the association between the optimized index structure and the persistent log data with the fragment identification in the distributed cluster; S4.3, based on the optimized index structure with the completed loading, executing a target user behavior analysis task, transmitting query filtering conditions to all the fragment storage nodes of the distributed cluster through an index push-down mechanism, screening the persistent log data of the local zone fragment identification by all the fragment storage nodes based on the query filtering conditions, and executing local aggregation calculation on the screened data; s4.4, summarizing local aggregation calculation results of the partitioned storage nodes to obtain query results of the target user behavior analysis task; and S4.5, collecting the dimension characteristics related to the query and the corresponding query performance data, and transmitting the dimension characteristics and the performance data of the query back to the S3.
Description
Behavior log time sequence data distributed storage and multidimensional analysis method for target users Technical Field The invention relates to the technical field of data processing, in particular to a method for distributed storage and multidimensional analysis of behavior log time sequence data for a target user. Background With popularization of Internet application, the time sequence data of the target user behavior log is in explosive growth, and the data carrying information such as user identification, operation behavior, time stamp and the like is a core support for user behavior analysis and service optimization. The method has the characteristics of strong time sequence continuity, multiple query dimensions and high real-time response requirement, so that distributed storage becomes necessary choice, but how to optimize storage partition layout, adapt to index structures of diversified queries, improve low-delay performance of multidimensional queries is still a current key technical challenge. Related patents have been explored in the art. The invention patent CN202510336285.4 discloses a distributed time series database management system and method, which determines a distributed storage technology based on time series data characteristics, analyzes and adjusts authority through user behavior data, desensitizes sensitive information and optimizes analysis results, and determines an optimal index structure according to real-time service requirements so as to improve the efficiency, safety and query efficiency of database management. As another example, the invention patent cn202311040804.X discloses a system, a method, a storage medium and a computing device for analyzing user behaviors, which are received by a log server and stored in a time sequence database according to the classification of the embedded point objects, and the log is analyzed according to configuration rules by a second analysis model, so as to realize the log association of the embedded point objects and the verification of the embedded point acquisition effectiveness. The technical scheme has the technical defects that firstly, index adjustment does not combine log aging characteristics and dimension query characteristic distribution, adaptability is insufficient, an index structure is determined based on service requirements, characteristics of decreasing log query frequency along with time are not considered, characteristic distribution differences of different query dimensions are not concerned, index adjustment lacks accurate adaptation of data self attributes, the invention patent CN202311040804.X relies on preset analysis rules to construct retrieval logic, index adaptation logic is fixed, dynamic adjustment cannot be achieved according to log aging and query dimension characteristic distribution, retrieval efficiency is limited, secondly, storage layout does not achieve collaborative design of time and user core dimensions, retrieval data integration cost is high, the invention patent CN202510336285.4 only designs a storage scheme based on time sequence characteristics, multi-node data are required to be retrieved when specific user cross-time data are retrieved, transmission and integration cost is increased, the invention patent CN202311040804.X stores according to embedded point objects, time and user dimension segmentation is not achieved, and time-span specific user dimension cross-time data are required to be integrated, and time-span time-consuming and complicated retrieval processes are required to be integrated. In view of this, we propose a method for distributed storage and multidimensional analysis of behavior log time-series data for target users. Disclosure of Invention The invention aims to provide a distributed storage and multidimensional analysis method of behavior log time sequence data for a target user, which aims to solve the problems that index adjustment provided in the background art is not combined with log aging characteristics and dimension query characteristic distribution, suitability is insufficient, the collaborative design of time and user core dimension is not realized by storage layout, and the integration cost of search data is high. In order to solve the technical problems, the invention provides a method for distributed storage and multidimensional analysis of behavior log time sequence data for a target user, which comprises the following steps: s1, performing behavior log distributed ingestion, namely receiving a target user behavior log time sequence data stream in real time, performing data partition and integrity verification after peak clipping through a distributed buffer queue, and outputting standardized preprocessing log data to S2; S2, the distributed data persistence storage is carried out, namely the standardized pretreatment log data output by the S1 are received, the standardized pretreatment log data are persistence to a distributed cluster, a time-user two-dim