CN-122021733-A - Distributed multi-mode training reasoning method, system and medium for heterogeneous nodes

CN122021733ACN 122021733 ACN122021733 ACN 122021733ACN-122021733-A

Abstract

The invention relates to the technical field of distributed computing and discloses a distributed multi-mode training reasoning method, a system and a medium of heterogeneous nodes, wherein the distributed multi-mode training reasoning method, the system and the medium comprise the steps of analyzing a distributed deployment topological structure of a multi-mode model, acquiring connection relations among mode branches and generating a mode interaction dependency graph; the unified precision configuration of the precision cooperative group is applied to an input end of a cross-modal interaction layer, a precision alignment operator is configured, interactive input data after precision alignment is output, hardware precision capability of each heterogeneous node in an reasoning stage is obtained, a training-reasoning precision migration mapping table is generated, and precision configuration and a distributed reasoning scheduling scheme in the reasoning stage are generated based on the training-reasoning precision migration mapping table. The invention has the technical effects of ensuring the stability of the model training numerical value and improving the convergence quality.

Inventors

SUN ZHIMING
CHEN HAITAO
LEI TONG
WANG BIN

Assignees

南京汇智互娱网络科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260113

Claims (10)

1. The distributed multi-mode training reasoning method of the heterogeneous node is characterized by comprising the following steps of: analyzing a distributed deployment topological structure of the multi-modal model, and obtaining a connection relation between modal branches to generate a modal interaction dependency graph; Based on the modal interaction dependency graph, classifying the modal branches with direct interaction connection into the same precision cooperative group, combining hardware precision capability parameters of each heterogeneous node, determining uniform precision configuration for each cooperative group under the condition of meeting precision compatibility constraint, and generating a cooperative precision allocation scheme; Applying the uniform precision configuration of the precision cooperative group to the input end of the cross-modal interaction layer, configuring a precision pair Ji Suanzi, performing scaling and format conversion operation on the characteristic data from different modal branches, and outputting interaction input data with aligned precision; Performing distributed training of the multi-mode model according to a collaborative precision distribution scheme, performing calculation on each heterogeneous node according to the precision configuration of the affiliated collaborative group, performing feature fusion on a cross-mode interaction layer through the precision alignment operator, and iteratively updating model parameters until convergence; Acquiring the hardware precision capability of each heterogeneous node in the reasoning stage, carrying out mapping analysis on the collaborative precision allocation scheme in the training stage and the precision capability of the reasoning node, and generating a training-reasoning precision migration mapping table; and executing weight format conversion and scaling factor recalculation on the modal branches needing precision adjustment based on the training-reasoning precision migration mapping table to generate a precision configuration and a distributed reasoning scheduling scheme in a reasoning stage.
2. The distributed multi-mode training reasoning method of the heterogeneous node according to claim 1, wherein the precision alignment operator calculates a scaling factor by adopting a dynamic quantization method, and for an input feature tensor, the scaling factor is calculated by dividing the maximum value of all element absolute values in the input feature tensor by the maximum positive value which can be represented by a target precision format, so as to obtain the scaling factor, and the scaled tensor is obtained by dividing an original tensor by the scaling factor, and the original numerical range is restored by multiplying the scaling factor after the interactive calculation is completed.
3. The distributed multi-mode training reasoning method of heterogeneous nodes according to claim 1, wherein the classifying of the mode branches with direct interaction connection into the same precision cooperative group comprises the steps of extracting all mode branch pairs with edge connection from the mode interaction dependency graph, merging the mode branches with transfer dependency relations into the same cooperative group by adopting a union algorithm, collecting hardware precision capability parameters of nodes deployed by each mode branch in the cooperative group for each cooperative group, taking intersections of supporting precision of each node as selectable precision sets of the group, and selecting precision formats with optimal calculation efficiency and meeting model precision requirements from the selectable precision sets as unified precision configuration of the cooperative group.
4. The method of claim 1, further comprising performing joint scheduling analysis on the multi-stage pipeline topology to identify available computing resources for bubble locations, bubble durations, and bubble periods for each stage of the intra-and inter-modality pipeline and generate a bubble profile and a bubble resource inventory.
5. The method of claim 4, further comprising extracting precision format conversion operations to be performed from a precision switching plan, estimating a calculated amount of each conversion operation, matching conversion tasks with bubble resources, preferentially allocating the conversion tasks to time periods in which time and resources can accommodate, and performing to generate a bubble filling scheduling scheme, wherein the calculated amount of the conversion operation is obtained by multiplying a number of elements of a tensor to be converted by a conversion operand of each element, and the execution time is obtained by dividing the calculated amount by a calculation force of an available calculation resource.
6. The method according to claim 5, wherein for the conversion task whose calculation amount exceeds the single bubble accommodation capacity, splitting the conversion task into a plurality of subtasks is performed in a plurality of continuous bubble periods, and a hierarchical buffer is configured on each stage of pipeline node to allocate independent buffer spaces for intra-mode data flows and inter-mode data flows.
7. The distributed multi-mode training reasoning method of heterogeneous nodes according to claim 6, wherein the hierarchical buffer zone is configured with a two-stage structure on each pipeline node, the first stage is an intra-mode buffer zone for storing intermediate results of the local-mode pipeline, the second stage is an inter-mode buffer zone for storing cross-mode interactive input and output data, and the two-stage buffer zones are independent in address space.
8. The method of claim 5, wherein the bubble filling scheduling scheme coordinates the calculation start time of each heterogeneous node, triggers the corresponding precision conversion task to execute when the pipeline execution enters the bubble period, writes the converted data into the standby area of the double buffer, reads the data from the standby area and switches the buffer role when the data is needed for the next stage calculation.
9. The method for reasoning distributed multi-mode training of heterogeneous nodes according to claim 8, wherein the double buffer area comprises two storage spaces of a main area and a standby area, input data is read from the main area in a current calculation stage, a precision conversion task writes results into the standby area, roles of the main area and the standby area are interchanged when the conversion task is completed and calculation of the next stage is about to start, and the original standby area becomes a new main area for reading in the next stage.
10. A distributed multi-modal training reasoning system of heterogeneous nodes for performing the distributed multi-modal training reasoning method of heterogeneous nodes as claimed in any of claims 1-9, comprising: the topology analysis module is used for analyzing the distributed deployment topology structure of the multi-modal model, acquiring the connection relation among all modal branches and generating a modal interaction dependency graph; The precision collaborative distribution module is used for classifying the mode branches with direct interactive connection into the same precision collaborative group based on the mode interactive dependency graph, and determining uniform precision configuration for each collaborative group by combining hardware precision capability parameters of each abnormal node; the precision alignment module is used for configuring a precision pair Ji Suanzi at the input end of the cross-modal interaction layer and executing scaling and format conversion operation on the characteristic data from different modal branches; the distributed training module is used for executing distributed training of the multi-mode model according to the collaborative precision allocation scheme; the precision migration mapping module is used for carrying out mapping analysis on the collaborative precision allocation scheme in the training stage and the precision capability of the reasoning node to generate a training-reasoning precision migration mapping table; and the reasoning adaptation module is used for executing weight format conversion and scaling factor recalculation based on the training-reasoning precision migration mapping table to generate a precision configuration and a distributed reasoning scheduling scheme in a reasoning stage.

Description

Distributed multi-mode training reasoning method, system and medium for heterogeneous nodes Technical Field The invention relates to the technical field of distributed computing, in particular to a distributed multi-mode training reasoning method, system and medium of heterogeneous nodes. Background In the distributed training and reasoning scenario of a very large scale multimodal model, the multimodal model consists of multiple branches that process different modality data, such as visual branches, language branches, audio branches, etc., which require feature interactions and fusion at the cross-modality attention layer. In order to improve the calculation efficiency, heterogeneous node clusters are adopted to execute distributed calculation in actual deployment, different nodes are provided with different types of processors and accelerators, and supported numerical precision formats are different. In the prior art, an independent mixed precision configuration strategy is adopted, namely, each modal branch selects precision formats such as FP32, FP16 or BF16 according to own calculation characteristics. When reasoning deployment is carried out, the accuracy configuration scheme determined in the training stage is directly migrated to the reasoning node. In distributed training, multi-stage pipeline execution is employed, including intra-modality pipelines and inter-modality pipelines. However, the prior art has the following technical problems that firstly, when the mode branches with different precision configurations are subjected to feature fusion in a cross-mode interaction layer, numerical calculation abnormality is caused due to the difference of numerical ranges and expression precision, the numerical calculation abnormality appears as gradient explosion, underflow or precision loss, model convergence quality is affected, secondly, when a precision configuration scheme determined in a training stage is migrated to an reasoning stage, the precision configuration scheme is often not matched with hardware precision capability of heterogeneous nodes used in reasoning deployment, so that reasoning fails or additional expense caused by full precision conversion is needed, thirdly, bubble time periods exist in the execution process of a multi-stage pipeline, and if two stages of pipelines are independently scheduled, the bubble time periods can overlap, so that calculation resources are completely idle in the time periods and are wasted. Disclosure of Invention The invention provides a distributed multi-mode training reasoning method, system and medium of heterogeneous nodes, which solve the technical problems of abnormal multi-mode interaction layer values, training-reasoning precision migration mismatch and pipeline bubble time period resource waste in the related technology. The invention provides a distributed multi-mode training reasoning method of heterogeneous nodes, which comprises the following steps: analyzing a distributed deployment topological structure of the multi-modal model, and obtaining a connection relation between modal branches to generate a modal interaction dependency graph; Based on the modal interaction dependency graph, classifying the modal branches with direct interaction connection into the same precision cooperative group, combining hardware precision capability parameters of each heterogeneous node, determining uniform precision configuration for each cooperative group under the condition of meeting precision compatibility constraint, and generating a cooperative precision allocation scheme; Applying the uniform precision configuration of the precision cooperative group to the input end of the cross-modal interaction layer, configuring a precision pair Ji Suanzi, performing scaling and format conversion operation on the characteristic data from different modal branches, and outputting interaction input data with aligned precision; Performing distributed training of the multi-mode model according to a collaborative precision distribution scheme, performing calculation on each heterogeneous node according to the precision configuration of the affiliated collaborative group, performing feature fusion on a cross-mode interaction layer through the precision alignment operator, and iteratively updating model parameters until convergence; Acquiring the hardware precision capability of each heterogeneous node in the reasoning stage, carrying out mapping analysis on the collaborative precision allocation scheme in the training stage and the precision capability of the reasoning node, and generating a training-reasoning precision migration mapping table; and executing weight format conversion and scaling factor recalculation on the modal branches needing precision adjustment based on the training-reasoning precision migration mapping table to generate a precision configuration and a distributed reasoning scheduling scheme in a reasoning stage. Further, the precision alignment opera