CN-115455150-B - Dual tensor parallel method for mixed word frequency embedding

CN115455150BCN 115455150 BCN115455150 BCN 115455150BCN-115455150-B

Abstract

The invention discloses a dual tensor parallel method for mixed word frequency embedding, which specifically comprises the following steps of S1, firstly scanning a data set used for one-time training through a task distributor, counting the occurrence times of word ids of each query, and then uniformly cutting lines of an embedded table on parallel devices according to the total number of word frequencies by using a greedy algorithm (minimax: enabling the maximum difference of the word frequencies among the embedded tables after cutting to be as small as possible), so that the word frequencies on each device are basically consistent. The dual tensor parallel method for mixed word frequency embedding ensures the uniform spreading of training quantity by uniformly and transversely cutting according to access quantity through word frequency distribution information of an embedded table when tensors are parallel, simultaneously supports the operation of an embedded bag, effectively reduces the cost of communication between devices compared with simple embedding, and performs secondary cutting and algorithm compression on the embedded table according to word frequency on each parallel device, thereby effectively further reducing the memory consumption on each device.

Inventors

HAN JIATONG
Lou Yuxuan
WU JUNMING
LU GUANGYANG
CHEN WEIWEN
FANG JIARUI
LI SHENGGUI
BIAN ZHENGDA
LI YONGBIN
LIU HONGXIN
Mai Siqi
LIU YULIANG
HUANG HAICHEN

Assignees

北京潞晨科技有限公司

Dates

Publication Date: 20260512
Application Date: 20220927

Claims (2)

1. A dual tensor parallel method for mixed word frequency embedding is characterized by comprising the following steps: S1, firstly scanning a data set used for one-time training through a task distributor, counting the occurrence times of word ids of each query, then uniformly cutting lines of an embedded table to parallel devices according to the total word frequency by utilizing a greedy algorithm, enabling the word frequency numbers on each device to be basically consistent, preprocessing input data, only reserving the word ids which can be queried on each device, and replacing the words which cannot be queried with special values; S2, dividing an embedded bag into embedded blocks according to rows after sequencing according to different quantiles according to the characteristic of long tail, compressing the existing embedded dimension to a proper value according to the number of word frequencies in each divided embedded block, expanding the final embedded dimension of each embedded block into a unified dimension by using a corresponding linear projector, wherein the quantiles can be 25%,50% and 75%, and the embedded bag is an embedded data structure and supports summation, mean value and maximum operation; S3, collecting cut embedded bags by using the reduction and all-reduction operations, wherein the reduction operation is performed inside the device to obtain embedded shapes for the division situation only occurring on a single device, and the all-reduction operation is performed among devices for the division situation on a plurality of devices, namely the embedded list of each device is transmitted to other devices and connected, and then the sum operation is performed to obtain the corresponding shapes.
2. The method of claim 1, wherein the step S2 follows the principle that the lower the word frequency is, the higher the compression strength is.

Description

Dual tensor parallel method for mixed word frequency embedding Technical Field The invention relates to the technical field of deep learning, in particular to a mixed word frequency embedded dual tensor parallel method. Background With the continuous increase of the user quantity and the continuous increase of commodity types, the information quantity of the users and commodities to be embedded in the deep learning type recommendation system model is exponentially increased, and the model can reach trillion-level parameter quantities in some scenes. The prior solution includes adopting tensor parallel mode to distribute the embedded table to different devices for training, specifically implementing steps of firstly cutting the embedded table to different devices uniformly according to columns, then doing embedded query operation on each device, then doing all-to-all operation to collect all query results on each device and bond on initial cutting dimension, or using word frequency information statistics to carry out algorithm compression, such as assigning high frequency words to large embedded dimension, assigning low frequency words to smaller dimension, then linearly projecting uniform embedded dimension, removing parameter redundancy, for example, the size of the embedded table is (|V|, D), if there are high frequency, medium frequency and low frequency words in the proportion of |V|/3 respectively (limit is calculated by a specific method), the compressed embedded dimension can be D, D/2, D/4, and the corresponding required linear dimension is 0 (no projection), (D/2, D) and (D/4 ). However, in the implementation process of the tensor parallel mode, word frequency information distribution is not considered, no embedded access quantity is possible on some devices, the advantage of tensor parallel is not utilized, meanwhile, only the operation of embedding instead of embedding bags is supported, and larger communication cost is generated when devices communicate. Disclosure of Invention (One) solving the technical problems Aiming at the defects of the prior art, the invention provides a dual tensor parallel method for mixed word frequency embedding, which solves the problems that word frequency information distribution is not considered, no embedded access quantity is possible to exist on some devices, tensor parallel advantage is not utilized, meanwhile, only operation of embedding instead of embedding bags is supported, and larger communication cost is generated when communication between devices is carried out. (II) technical scheme The invention is realized by the following technical scheme that the dual tensor parallel method for embedding mixed word frequency specifically comprises the following steps: S1, firstly scanning a data set used for one-time training through a task distributor, counting the occurrence times of word ids of each query, then uniformly cutting lines of an embedded table on parallel devices according to the total number of word frequencies by using a greedy algorithm (minimax: enabling the maximum difference of the word frequencies among the cut embedded tables to be as small as possible), enabling the word frequencies on each device to be basically consistent, preprocessing input data, only reserving the word ids which can be queried on each device, and replacing the words which cannot be queried with special values; S2, dividing the embedded bags into embedded blocks according to rows after sorting according to different fractional numbers (such as 25%,50%,75% and the like) according to the characteristic of long tail, compressing the existing embedded dimension to a proper value according to the number of word frequencies in each divided embedded block, expanding the final embedded dimension of each embedded block into a uniform dimension by using a corresponding linear projector, and achieving the purpose of algorithm compression because the parameter increase required by the linear projector is generally far smaller than the parameter reduction amount caused by compressing the embedded dimension; S3, double tensor parallelism is mainly embodied in two divisions of the embedded table, and the corresponding two operations of gathering are needed to finally collect the cut embedded bag, the gathering operation is used here as (All-) reduction, the main function of the method is to gather the cut embedded table according to a specific mode (such as adding, taking the maximum value or taking the average value), the method utilizes the same of the method and the embedded table (including adding, taking the maximum value or taking the average value), reduces the communication requirement between devices (the complexity of the original O (b) n (d)) of the representation method of multiplexing beginning is reduced to O (b (d)), and the method ensures that the same result as single device processing can be obtained after gathering. Preferably, in the step S2, the rule that the lowe