CN-122019156-A - CPU end DLRM reasoning dynamic optimization method based on load pre-calculation

CN122019156ACN 122019156 ACN122019156 ACN 122019156ACN-122019156-A

Abstract

The invention provides a CPU end DLRM reasoning dynamic optimization method based on load pre-calculation, and belongs to the technical field of deep learning and computer system optimization. The method comprises the steps of step 1 of dynamic core allocation, S1 of constructing a heterogeneous load quantization model of an embedded search task and a Bottom-MLP task, S2 of load sampling and dynamic adjustment, S3 of modifying a deep learning frame thread pool mechanism, allocating and binding CPU physical cores for the two tasks to realize parallel scheduling, S2 of dynamic prefetching, T1 of establishing an timeliness matching and accurate prefetching principle and definitely prefetching distance as a key parameter aiming at embedded search three-level indirect addressing characteristics, T2 of circularly recording an iteration period in embedding-bag operator core, separating aggregate calculation time and DRAM access time, T3 of dynamically calculating the prefetching distance and inserting a prefetching instruction to execute prefetching. The invention effectively solves the problem of unbalanced distribution of heterogeneous task cores and irregular memory access prefetching failure, reduces DLRM reasoning end-to-end delay, and improves the CPU resource utilization rate and the cache hit rate.

Inventors

Qin Te
MA YECHI
WANG QIANLEI
WENG CHULIANG

Assignees

华东师范大学

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (8)

1. The CPU end DLRM reasoning dynamic optimization method based on load pre-calculation comprises two independent modules of dynamic core allocation DCA and dynamic pre-fetching DPF, and is completed according to the following steps: step 1, dynamic core allocation, the following steps are executed by a dynamic core allocation DCA module: S1, constructing a load quantization model, which comprises the following steps: S1-1. Calculating the load of the embedded search task ; S1-2. Calculating the load of the Bottom-MLP task ; S2, load sampling and dynamic adjustment, including: S2-1, a cold start sampling mechanism; S2-2, an incremental sampling mechanism in the running process; S2-3, dynamically distributing CPU physical core resources; S3, modifying a thread pool mechanism of the deep learning framework to realize parallel scheduling of two types of tasks, wherein the method comprises the following steps: s3-1, modifying a thread pool mechanism of the deep learning framework; S3-2, respectively binding thread pools of the embedded search task and the Bottom-MLP task to the corresponding physical cores distributed in the step S2; S3-3, realizing parallel scheduling of two types of tasks; Step 2, dynamic prefetching, wherein the dynamic prefetching DPF module executes the following steps: T1, establishing a core design principle and defining key control parameters aiming at three-level indirect addressing characteristics of embedded search, wherein the method comprises the following steps: t1-1, establishing an timeliness matching principle and an accurate prefetching principle; t1-2. Explicitly prefetching the distance pf_dist to be a core control parameter, and adjusting the pf_dist to meet an optimization target: wherein Aggregation calculation time for single embedding lookup, including AVX vector addition and storage operations; DRAM access time for single embedded table row data; T2. recording an iterative execution period in a core loop of embedding _bag operator, including: T2-1, recording an iteration execution period; T2-2 separation of aggregate computation time for single embedded lookup And DRAM access time ; T2-3, sampling parameters by adopting a sampling period consistent with the step S2; t3. performing a dynamic prefetch operation in the core loop of the embedding _bag operator, comprising: T3-1, calculating a pre-fetch distance; T3-2. Prefetch instruction insertion.
2. The method for dynamic optimization of CPU-side DLRM inference based on load pre-calculation according to claim 1, wherein the calculation process of embedding the search task load in step S1-1 includes: s1-1a. Calculating the average pooling factor of the ith embedding table The formula is: ; wherein the method comprises the steps of The total searching times of the ith embedded table in a single batch are calculated, and the batch size is the reasoning batch size; s1-1b. Calculating the average memory Access time of the ith Table of embedments The formula is: ; wherein the method comprises the steps of For the L3 cache hit rate of the i-th embedded table, For the CPU L3 cache hit delay, Delay for DRAM access; s1-1c. Calculating the table load of the ith embedded table [ ] ) The formula is ; S1-1d. Accumulating the table loads of all the embedded tables to obtain a total embedded lookup load The formula is 。
3. The method for dynamic optimization of CPU end DLRM reasoning based on load pre-calculation according to claim 1, wherein step S1-2 is characterized in that the Bottom-MLP task load is [ ] ) The calculation process of (1) is as follows: ; Wherein total_MACs are Total multiplication and addition operation times of all full connection layers of the Bottom-MLP, and the calculation formula of total_MACs is as follows: ; Wherein the method comprises the steps of The number of layers for the Bottom-MLP, And Respectively the first The input and output dimensions of the layers, ACMO, is the average number of CPU cycles for a single MAC operation.
4. The method for dynamic optimization based on load pre-calculation of CPU end DLRM reasoning according to claim 1, wherein the step S2-1 is a sampling batch of a cold start sampling stage The value range is 5-15, and the embedded table is in a high updating frequency scene Taking 10-15, and embedding the embedded table in a low updating frequency scene Sampling period of 5-10 times of the incremental sampling stage in the operation of the step S2-2 The value range is 80-200, and the embedded table is in a high updating frequency scene 80-120 Is taken and embedded in a low-update-frequency scene of the table And the updating frequency of the embedded table is determined according to the rule that the updating proportion of the embedded table exceeds a preset threshold value to be a high updating frequency scene, the updating frequency is lower than the preset threshold value to be a low updating frequency scene, and the preset threshold value is determined by combining Service Level Agreement (SLA) delay constraint and specific service scene characteristics.
5. The method for dynamic optimization based on load pre-calculation by inference at the CPU end DLRM as set forth in claim 1, wherein the specific steps of the dynamic core allocation algorithm in step S2-3 include: S2-3a, obtaining the total available physical core number of the CPU ; S2-3b calculating the duty cycle of the embedded find task ; S2-3c, assigning core number of embedded search task ; S2-3d. assigning core number of Bottom-MLP task ; S2-3e per interval And repeating the steps S2-3 b-S2-3 d for each batch, updating the load proportion and reallocating the cores.
6. The method for dynamic optimization based on load pre-calculation by reasoning at the CPU end DLRM, which is characterized in that in the step S3, the deep learning framework is PyTorch + Intel Extension for PyTorch, and parallel scheduling of embedded search tasks and Bottom-MLP tasks is realized through a torch.jit_for interface.
7. The method for dynamic optimization based on load pre-calculation of CPU DLRM inference as claimed in claim 1, wherein in said step T2, the recording iteration execution period of T2-1 is that the beginning period of each embedded search loop iteration is recorded And end period Calculate the total period of iteration Aggregated computation time for single-pass embedded lookup as described in T2-2 Taking all iterations of T2-1 As the minimum value of (2) DRAM access time takes the time of all iterations - ) As the maximum value of (2) 。
8. The method for dynamic optimization based on load pre-calculation of CPU end DLRM reasoning of claim 1, wherein the pre-fetch distance of step T3-1 is calculated according to the optimization target of T1 and T2 And And (3) calculating to obtain pf_dist, wherein the prefetch instruction of T3-2 is a mm_prefetch instruction, and prefetching the embedded table data corresponding to the subsequent pf_dist indexes while embedding vector aggregation calculation.

Description

CPU end DLRM reasoning dynamic optimization method based on load pre-calculation Technical Field The invention belongs to the technical field of deep learning and computer system optimization, in particular relates to a CPU architecture-oriented Deep Learning Recommendation Model (DLRM) reasoning dynamic optimization method, which is particularly suitable for solving the performance bottleneck caused by unbalanced core allocation of heterogeneous tasks and irregular memory access of an embedded layer, can be widely applied to an online reasoning scene of an industrial recommendation system, and particularly relates to a CPU end DLRM reasoning dynamic optimization method based on load pre-calculation. Background The Deep Learning Recommendation Model (DLRM) is a core component of an industrial recommendation system in the fields of electronic commerce, online advertising, media entertainment and the like, and has the core function of realizing quick prediction of user behaviors (such as click rate and conversion rate). The training process of DLRM generally depends on the massive parallel computing capability of the GPU, but the online reasoning scene has the characteristics of small batch data processing, sensitive deployment cost, high resource elasticity requirement and the like, so that the CPU becomes a mainstream hardware platform for DLRM online reasoning by virtue of excellent small batch adaptability, low early investment cost and flexible resource scheduling capability. The typical DLRM reasoning process contains four core phases, bottom multilayer perceptron (Bottom-MLP), embedded lookup (Embedding Lookup), feature interaction (Feature Interaction), and Top multilayer perceptron (Top-MLP), and has significant "heterogeneous task features". As can be seen from FIG. 1, four core stages of DLRM reasoning are sequentially executed, and task attribute differences of different stages are obvious, namely, a Bottom-MLP, feature interaction and Top-MLP belong to computation intensive tasks, the performance of which mainly depends on the computation resources of a CPU, while embedded lookup is a typical memory intensive task, a large-scale embedded table needs to be accessed through a sparse index, the execution time occupation ratio of the embedded table exceeds 60%, and the execution time occupation ratio reaches 98% even in part of scenes, so that the embedded table is a key bottleneck for determining DLRM reasoning end-to-end delay. Despite the wide application of CPUs in DLRM reasoning, the prior art still faces two core system level challenges, and both existing solutions have significant drawbacks, failing to fundamentally solve the problem: The first key challenge is the problem of core allocation imbalance of heterogeneous tasks. There is no data dependence between the Bottom-MLP and the embedded lookup, and the conditions of thread-level parallel execution are provided, which provides an important opportunity for improving the CPU resource utilization rate. The existing parallel optimization strategies can be divided into three types, but all have the defects which are difficult to overcome, and the method comprises the following steps: 1. the sequential execution strategy is that the multi-core parallel capability is not utilized at all, all operators are executed sequentially, the efficiency is extremely low, the method belongs to an early basic scheme, and the practical industrial-level high-efficiency application value is avoided; 2. The model-level parallel strategy is to improve throughput by running a plurality of independent reasoning examples through multithreading, but easily causes 'calculation-calculation', 'memory-memory' resource competition, so that buffer jolts is caused, the parallel acceleration ratio is limited, and heterogeneous tasks in a single reasoning example are not optimized; 3. The operator-level parallel strategy (HT strategy) supports the parallel execution of the Bottom-MLP and the embedded search, but adopts a uniform or random fixed core allocation mode, and ignores the inherent load difference of the two tasks. In an actual scene, the load of the embedded search is often far higher than that of the Bottom-MLP, the embedded search with high load is slow to run due to insufficient resources caused by fixed allocation, and the Bottom-MLP with low load is idle after being finished in advance. In addition, the existing dynamic allocation scheme depends on heuristic rules or coarse-granularity static partitions, lacks a quantitative evaluation model for task load, cannot adapt to a dynamic variation reasoning scene, and further aggravates core allocation imbalance. The second key challenge is that irregular memory accesses embedded in the lookup stage lead to prefetch invalidation issues. The embedded lookup process adopts embedding _bag and other efficient operators of PyTorch, and a three-level indirect addressing mechanism of 'offset array- & gt index array- & gt emb