CN-121807937-B - Modeling method of statistical hybrid model in big data distributed scene

CN121807937BCN 121807937 BCN121807937 BCN 121807937BCN-121807937-B

Abstract

The invention relates to the technical field of computers and discloses a modeling method of a statistical hybrid model in a big data distributed scene. The method comprises the steps of storing data fragments in a distributed manner, initializing model parameters, iteratively executing expected steps and maximization steps, namely scheduling the expected steps to GPU nodes, calculating posterior probability in parallel, maximizing the steps to CPU node aggregation statistics, updating parameters, executing component merging/deleting and convergence judgment, and reducing redundant data transmission by adopting a memory multiplexing mechanism based on reference counting and scope analysis. According to the invention, through cooperative optimization of heterogeneous task scheduling and memory, training speed, resource utilization rate and model self-adaption capability are improved.

Inventors

LIU QIHONG

Assignees

三亚学院

Dates

Publication Date: 20260508
Application Date: 20260306

Claims (10)

1. A method for modeling a statistical hybrid model in a big data distributed scenario, comprising: Dividing an input original observation data set into a plurality of data fragments according to a preset fragmentation rule, and storing the data fragments in a plurality of working nodes of a distributed computing cluster in a distributed manner; initializing a parameter set of the mixed model, wherein the parameter set comprises a weight coefficient, a mean value vector and a covariance matrix of each component; executing an iterative optimization process until a preset convergence condition is met, wherein the iterative optimization process comprises alternately executing an expected step and a maximization step; When executing the expected steps, dispatching the expected step tasks to a computing node provided with a CPU graphics processor, and using the CPU graphics processor to calculate the posterior probability of each data slice in parallel on each mixed component; when a maximization step is executed, dispatching the maximization step task to a computing node provided with a central processing unit, aggregating posterior probability statistics from all data fragments by using the central processing unit, and executing component parameter updating, sparse data filtering, component merging or splitting judgment and convergence evaluation; And between the expected step and the maximization step, a memory multiplexing mechanism based on reference counting and scope analysis is adopted to manage the life cycle of the intermediate calculation result, so that redundant data replication and frequent data migration between host devices are avoided.
2. The modeling method of a statistical hybrid model in a big data distributed scenario according to claim 1, wherein dividing an input original observation data set into a plurality of data slices according to a preset slicing rule comprises: Uniformly distributing data to a preset number of data fragments according to a primary key hash value or a time stamp range of the observed data; The size of each data slice is not larger than the upper limit of the memory capacity of a single working node, and the number of the data slices is matched with the number of the CPU graphics processors in the cluster.
3. The method of modeling a statistical hybrid model in a big data distributed scenario of claim 2, wherein initializing a set of parameters of the hybrid model comprises: Randomly sampling a plurality of subsets from an original observation data set, and calculating global mean and covariance of the subsets as an initial mean vector and covariance matrix; the initial weight coefficients of the respective mixed components are set to equal values, which are equal to the inverse of the number of mixed components.
4. A method of modeling a statistical hybrid model in a big data distributed scenario according to claim 3, wherein in performing a desired step, scheduling the desired step task to a computing node equipped with a GPU graphics processor comprises: Constructing a desired step calculation task graph, wherein the calculation dependency relationship between each data slice and each mixed component is defined by the constructed desired step calculation task graph; distributing the computing subtasks in the task graph to idle CPU graphics processors in the cluster; loading corresponding data fragments and current model parameters on a GPU graphic processor, and executing matrix multiplication, exponential function and normalization operation to generate a posterior probability matrix of each observation sample on each mixed component; the posterior probability matrix is stored in a display memory of the CPU graphics processor in a column-first format.
5. The modeling method of a statistical hybrid model in a big data distributed scenario according to claim 4, wherein the posterior probability is expressed specifically as: ; For the sample about the first The posterior probability of the individual components is calculated, For the edge likelihood of the sample, Is the first The likelihood values under the individual mixture components, Is the corresponding weight coefficient.
6. The method for modeling a statistical hybrid model in a big data distributed scenario of claim 5, wherein in performing a maximization step, scheduling the maximization step task to a computing node equipped with a central processor comprises: starting an aggregation process on a central processing unit, wherein the aggregation process pulls a statistical summary of a posterior probability matrix from a video memory of each CPU graphic processor through a remote direct memory access technology, and the statistical summary comprises total responsibility weight, weighted observation sum and weighted outer sum of each component; calculating a new mean vector, covariance matrix and weight coefficient based on the statistical summary; performing positive qualitative verification on the newly calculated covariance matrix, and if the covariance matrix is not satisfied, applying diagonal loading correction; judging the responsibility weight of each component by a threshold value, and if the responsibility weight is smaller than a preset minimum responsibility threshold value, marking the component as a state to be deleted; Calculating the mean vector distance of any two components, and if the mean vector distance is smaller than a preset merging distance threshold and the Froude-Luo Beini Usnea norm difference of the covariance structure is smaller than a preset similarity threshold, executing component merging operation; and judging whether convergence conditions are met or not based on the log likelihood difference values of the new and old parameter sets.
7. The method of modeling a statistical hybrid model in a big data distributed scenario of claim 6, wherein performing a component merge operation comprises: adding the weight coefficients of the two components to be combined to be used as the weight of the new component; dividing the sum of the weighted mean vectors of the two components by the total weight to obtain a mean vector of the new component; and combining the weighted covariance matrix of the two components, the weighted outer product term and the cross term, and then recalculating the covariance matrix of the new component.
8. The method for modeling a statistical hybrid model in a big data distributed scenario of claim 7, wherein a memory multiplexing mechanism based on reference counting and scope analysis is employed, comprising: Pre-distributing a buffer zone with a fixed size in a video memory of a Graphic Processing Unit (GPU) for each data slice before a desired step starts; after the calculation of the expected step is completed, reserving a posterior probability matrix until the polymerization process of the maximized step finishes reading the posterior probability matrix; Immediately releasing all intermediate data buffers related to the previous iteration after the parameter updating is completed in the maximizing step; for model parameters, a single authoritative copy is maintained in the CPU memory, and is copied to the video memory of each CPU graphics processor through asynchronous data transmission before each expected step starts, and the transmission process overlaps with the calculation process to hide communication delay.
9. The method for modeling a statistical hybrid model in a big data distributed scenario of claim 8, wherein the preset convergence condition comprises: in the continuous three-cycle iteration, the increment absolute value of the log likelihood of the statistical hybrid model is smaller than a preset likelihood convergence threshold, or the total iteration round number reaches a preset maximum iteration time upper limit.
10. The method for modeling a statistical hybrid model in a big data distributed scenario of claim 9, wherein the first Likelihood values under the individual mixture composition are given by a multivariate gaussian probability density function: ; Is the first The first of the individual slices The feature vector of the individual samples is used, And (3) with Respectively the first The current mean vector and covariance matrix of the individual components, For the feature dimension, the edge likelihood of the sample is: ; Is that Is a sum of (3).

Description

Modeling method of statistical hybrid model in big data distributed scene Technical Field The invention belongs to the technical field of computers, and particularly relates to a modeling method of a statistical hybrid model in a big data distributed scene. Background With the deep penetration of big data technology in the fields of financial wind control, intelligent medical treatment, bioinformatics, the Internet of things and the like, the analysis method based on the statistical hybrid model is widely applied due to the strong modeling capability of the analysis method on complex data distribution. The hybrid model approximates the multi-modal nature of real data by combining multiple probability distribution components, and the training process is typically iteratively optimized using a Expectation Maximization (EM) algorithm. The expectation-maximization algorithm has good convergence in theory, but faces severe computational efficiency and resource adaptation problems in practical large-scale distributed deployment. Particularly, under heterogeneous hardware environment, problems exist between the dynamic property of model training and the static scheduling of hardware characteristics, and the efficient landing of the hybrid model in cloud-edge cooperative scenes is restricted. The EM iteration process of the hybrid model naturally presents calculation heterogeneity, namely E step involves posterior probability calculation of a large number of samples on each component, high-density matrix operation is presented, high parallelism is achieved, the method is suitable for accelerator execution of a GPU graphic processor and the like, M step comprises logic intensive operations such as model parameter updating, component merging or splitting judgment, sparse data processing, convergence checking and the like, complicated control flow and conditional branches are relied, and the method is more suitable for a general calculation architecture of the CPU graphic processor. However, a coarse-granularity task partitioning strategy is generally adopted by the currently mainstream distributed computing framework, the whole EM iteration is regarded as a single computing unit to be scheduled, the fine-granularity computing characteristic difference inside the hybrid model cannot be perceived, so that GPU resources are idle in an M-step stage, a CPU becomes a bottleneck in an E-step stage, and significant hardware utilization imbalance is caused. The prior art has the following problems that dynamic perceptibility of runtime features such as a mixed model structure and data sparsity is lacking, calculation load boundaries of E steps and M steps cannot be accurately identified before task execution, a task scheduling mechanism is not deeply coupled with bottom hardware characteristics, subtask intelligent unloading and execution path optimization of cross-equipment are difficult to realize, frequent cross-equipment data transmission is not effectively pipelined, so that data movement overhead between a GPU and a CPU becomes a performance bottleneck, and the problem is more prominent especially in a scene with limited memory of edge equipment. The problems cause the same set of mixed model algorithm to be difficult to adaptively tune in clusters with different hardware configurations, and the full-scene deployment capability from a cloud high-performance server to a resource-limited edge node is severely limited. Therefore, a new modeling method capable of perceiving the computational characteristics of the hybrid model, supporting heterogeneous hardware co-scheduling and optimizing data flow is needed. Disclosure of Invention The invention provides a modeling method of a statistical hybrid model in a big data distributed scene, and aims to solve the technical problems of low calculation efficiency, insufficient hardware resource utilization rate and slow model convergence speed caused by failure to carry out fine-grained resource scheduling and memory collaborative optimization on heterogeneous calculation characteristics of an expected step and a maximized step in an expected maximization algorithm when the existing distributed calculation frame processes the statistical hybrid model. According to the invention, by constructing a task decomposition and dynamic scheduling mechanism oriented to heterogeneous computing units and combining a memory management strategy based on data locality and life cycle, efficient collaborative execution of dense matrix operation of expected steps and sparse logic processing of maximized steps is realized. The invention provides a modeling method of a statistical hybrid model in a big data distributed scene, which comprises the following steps: Dividing an input original observation data set into a plurality of data fragments according to a preset fragmentation rule, and storing the data fragments in a plurality of working nodes of a distributed computing cluster in a distributed man