CN-121809550-B - Reasoning efficiency optimization method for intelligent middle platform fusion large model

CN121809550BCN 121809550 BCN121809550 BCN 121809550BCN-121809550-B

Abstract

The invention discloses an inference efficiency optimization method for an intelligent middle stage fusion large model, which relates to the technical field of artificial intelligence and comprises the steps of converting a sequence to be processed into a symbolized characterization stream, analyzing distribution characteristics of the symbolized characterization stream to determine task complexity attributes, generating a task package to be inferred, matching the task package to be inferred with a load threshold, when a preset degradation condition is met, reconstructing a logic gate execution sequence, calling a narrow bandwidth path in a multidimensional heterogeneous inference space to generate a differential execution instruction stream, running the differential execution instruction stream to acquire an initial inference result, monitoring the semantic drift state of the initial inference result relative to an expected logic track, positioning a deviation point and extracting a compensation rescheduling instruction carrying an operation snapshot when a semantic drift state is judged to trigger a correction mechanism.

Inventors

YU LONG
WANG HAIWEI
LI RUIHENG
HAN XIANGSONG
LI MEIYING

Assignees

政和科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260309

Claims (10)

1. The reasoning efficiency optimization method of the intelligent middle platform fusion large model is characterized by comprising the following steps of: identifying and fusing a large model core association region, and compressing a non-core region to form a multidimensional heterogeneous reasoning space; converting the sequence to be processed into a symbolized characterization flow, analyzing the distribution characteristics of the symbolized characterization flow to determine the task complexity attribute, and generating a task package to be inferred; matching a task package to be inferred with a load threshold, and when a preset degradation condition is met, reconstructing a logic gate execution sequence and calling a narrow bandwidth path in a multidimensional heterogeneous inference space to generate a differential execution instruction stream; when judging that the semantic drift state triggers a deviation correcting mechanism, positioning a deviation point and extracting a compensation rescheduling instruction carrying an operation snapshot; And responding to the compensation rescheduling instruction, calling high-fidelity resources according to the deviation point, executing directional supplementary reasoning on the operation snapshot, outputting a reasoning result, and regulating task package generation logic to be reasoning based on correction feedback.
2. The method for optimizing the reasoning efficiency of the intelligent center-desk fusion large model as claimed in claim 1, wherein the method for forming the multidimensional heterogeneous reasoning space comprises the following steps: Acquiring norm distribution and gradient sensitivity indexes of each layer of a fused large model, constructing an importance distribution map, and dividing a core association region and a non-core region according to the relative sensitivity relation of each layer in the importance distribution map; structural pruning of connection importance is carried out on the non-core area, and connection relations which do not participate in reasoning main paths are removed to form a sparse connection structure; Based on the sparse connection structure, the parameters of the core associated area are represented by adopting a first parameter precision grade, the parameters of the non-core area are represented by adopting a second parameter precision grade lower than the first parameter precision grade, and a mapping relation between the parameter precision grade and a calculation load is established to form a multidimensional heterogeneous reasoning space.
3. The method for optimizing the inference efficiency of an intelligent intermediate fusion large model according to claim 2, wherein the converting the sequence to be processed into a symbolic representation stream comprises: performing standardized cleaning on the sequence to be processed, removing characters which do not meet the coding specification, disassembling the sequence to be processed into minimum semantic units by identifying the association strength among the characters, mapping the minimum semantic units into corresponding integer indexes, and arranging the integer indexes according to the original sequence to generate a symbolized characterization stream.
4. The method for optimizing the inference efficiency of an intelligent intermediate platform fusion large model according to claim 3, wherein the analyzing the symbolic representation stream distribution characteristics to determine task complexity attributes comprises: Performing linear mapping processing on the symbolized characterization flow to obtain an activation distribution state, and converting the activation distribution state into activation probability distribution; Analyzing semantic entropy values of the symbolized representation flow according to the activation probability distribution, projecting the semantic entropy values to a task complexity distribution space which is dynamically updated, and determining task complexity attributes of the symbolized representation flow according to relative deviation positions of the semantic entropy values in the task complexity distribution space; extracting a semantic anchor index of the core association region, and packaging the semantic anchor index, the task complexity attribute and the symbolized characterization stream to generate a task package to be inferred.
5. The method for optimizing the reasoning efficiency of the intelligent middle-stage fusion big model as set forth in claim 4, wherein the matching the task package to be reasoning with the load threshold comprises: Mapping resource allocation requirements of a multidimensional heterogeneous reasoning space based on task complexity attributes of a task package to be reasoning, and generating a real-time predicted resource load value of the task package to be reasoning; Monitoring available resource allowance of the multidimensional heterogeneous inference space as a load threshold, comparing a real-time predicted resource load value with the load threshold, and determining a dynamic matching result as overload load when the real-time predicted resource load value exceeds the load threshold; And when the dynamic matching result is that the load is excessive and the task complexity attribute belongs to a preset regulation level, judging that a preset degradation condition is met, and activating a trigger signal of the reconfiguration logic gate execution sequence.
6. The method for optimizing the inference efficiency of an intelligent staging fusion large model as in claim 5 wherein the method for generating a stream of differentially executing instructions comprises: Responding to a trigger signal of a reconstruction logic gate execution sequence, and switching the full operator links in the fusion large model into simplified operator links; blocking access requests to redundant operators in a non-core area through a reconstruction logic gate, and establishing execution priority of key operators in a core associated area; According to the semantic anchor point index in the task package to be inferred, low-bit parameters in the second parameter precision level are fetched from the multidimensional heterogeneous inference space and loaded into the sparse connection structure, and a narrow bandwidth path with low throughput characteristic is constructed; Mapping a core feature stream corresponding to the semantic anchor index to a key operator execution path, and mapping the rest non-core feature stream to a narrow bandwidth path for retrieving low-bit parameters; And merging the execution path of the key operator and the narrow bandwidth path for calling the low-bit parameter, rearranging the execution sequence of the calculation instruction according to the reasoning time sequence, and generating a differential execution instruction stream consisting of the core region high-fidelity instruction and the non-core region compression instruction.
7. The method for optimizing the inference efficiency of an intelligent intermediate platform fusion large model according to claim 6, wherein the method for monitoring the semantic drift state of the initial inference result relative to the expected logic track comprises the following steps: Operating a core area high-fidelity instruction and a non-core area compression instruction in the differential execution instruction stream, and outputting an initial reasoning result; Synchronously mapping theoretical distribution trend of the sequence to be processed in the hidden layer space according to the semantic anchor point index to form an expected logic track; and extracting an actual feature vector of the initial reasoning result, analyzing the geometric distance between the actual feature vector and the expected logic track in the Euclidean space, and monitoring the change amplitude of the geometric distance to generate a semantic drift state.
8. The method for optimizing the inference efficiency of an intelligent intermediate platform fusion large model according to claim 7, wherein the method for locating the departure point and extracting the compensation rescheduling instruction carrying the operation snapshot comprises the following steps: triggering a deviation correcting mechanism when the change amplitude of the geometric distance exceeds a preset logic tolerance threshold value, backtracking the operation time axis of the differential execution instruction stream, and locking the calculation period of the first abrupt change of the geometric distance into a deviation interval; Comparing the distribution difference of the actual feature vector and the expected logic track in the deviation interval, determining the position of the word element with the largest deviation amplitude as a deviation point, and searching the level index corresponding to the deviation point and the operator number to establish a coordinate mapping; And directionally extracting an intermediate variable and a context buffer at the trigger moment of the departure point according to the coordinate mapping to form an operation snapshot, and generating a compensation rescheduling instruction by combining the hierarchy index.
9. The method for optimizing the inference efficiency of an intelligent staging platform fusion large model according to claim 8 wherein said performing directed supplemental inference on a running snapshot comprises: Analyzing a hierarchy index and an operation snapshot in the compensation rescheduling instruction, and calling high-fidelity resources of a first parameter precision level in a multidimensional heterogeneous reasoning space according to the hierarchy index; Loading the intermediate variable and the context cache in the operation snapshot into a computing node where the offset point is located, and starting directional supplementary reasoning from the offset point by utilizing high-fidelity resources to generate a correction feature vector; And correcting abnormal characteristic distribution in the initial reasoning result by using the correction characteristic vector, finishing sequence decoding and outputting the reasoning result.
10. The intelligent staging fusion large-model reasoning efficiency optimization method of claim 9 wherein the adjusting task-pack generation logic based on revised feedback comprises: Comparing the correction feature vector with abnormal feature distribution in the initial reasoning result, obtaining correction feedback information, and associating the correction feedback information with the hierarchical index of the deviation point; identifying a semantic entropy threshold for triggering a semantic drift state according to the correction feedback information, and adjusting a division boundary aiming at task complexity attribute in a dynamically updated task complexity distribution space according to the semantic entropy threshold; And correcting and analyzing judgment parameters representing the flow distribution characteristics in a symbolized mode according to the adjusted dividing boundary, and finishing iterative adjustment of task package generation logic to be inferred.

Description

Reasoning efficiency optimization method for intelligent middle platform fusion large model Technical Field The invention relates to the technical field of artificial intelligence, in particular to an inference efficiency optimization method for an intelligent middle platform fusion large model. Background The artificial intelligence center serves as a core infrastructure for constructing large-scale and intensive AI capability of enterprises, the general technical capability is packaged in a componentization and service mode, the integrated large model becomes a necessary choice for improving the cognitive intelligence level of the enterprises, and the related technology mainly realizes reasoning acceleration through static means such as model quantization, pruning, operator fusion and the like. For example, by performing weight bit alignment or sparsification processing on core computation intensive modules such as QKV projections, FFNs, etc., and combining a dynamic computation graph reconstruction and variable length request fusion scheduling mechanism in an inference execution engine, the efficient execution link under a specific instruction set is aimed to be constructed. However, there is still a deep contradiction between the inherent computationally intensive nature of the large model reasoning phase and the requirements of the platform for stability, economy, and the prior art has limitations in the flexibility of allocation of computing resources. Specifically, the existing static compression fusion large model often adopts a 'one-cut' execution logic, and elastic expansion and contraction of calculation force are difficult to realize according to semantic complexity of different tasks. Because of the lack of an end-to-end cooperative mechanism covering the model layer, the framework layer and the infrastructure layer, when the path is forcibly switched to a lightweight path to execute complex semantic logic, the reasoning result is very easy to deviate from an expected logic track because of the lack of an effective real-time monitoring and dynamic computation compensating mechanism. The rigid reasoning mode lacking fine granularity state feedback causes the problem of mismatch which is difficult to reconcile between semantic expression precision and reasoning execution efficiency, so that the optimal solution of resource scheduling of an intelligent middle platform in a large-scale and multi-task concurrency scene is limited, and the semantic high-fidelity of the complex logic task output result is difficult to guarantee. Disclosure of Invention The present invention has been made in view of the above-described problems occurring in the prior art. Therefore, the invention provides an intelligent middle station fusion large model reasoning efficiency optimization method which solves the problem that reasoning efficiency and semantic accuracy are difficult to consider due to the lack of dynamic resource scheduling and accuracy compensation mechanisms. In order to solve the technical problems, the invention provides the following technical scheme: In a first aspect, the present invention provides a method for optimizing reasoning efficiency of an intelligent middle platform fusion large model, which includes: identifying and fusing a large model core association region, and compressing a non-core region to form a multidimensional heterogeneous reasoning space; converting the sequence to be processed into a symbolized characterization flow, analyzing the distribution characteristics of the symbolized characterization flow to determine the task complexity attribute, and generating a task package to be inferred; matching a task package to be inferred with a load threshold, and when a preset degradation condition is met, reconstructing a logic gate execution sequence and calling a narrow bandwidth path in a multidimensional heterogeneous inference space to generate a differential execution instruction stream; when judging that the semantic drift state triggers a deviation correcting mechanism, positioning a deviation point and extracting a compensation rescheduling instruction carrying an operation snapshot; And responding to the compensation rescheduling instruction, calling high-fidelity resources according to the deviation point, executing directional supplementary reasoning on the operation snapshot, outputting a reasoning result, and regulating task package generation logic to be reasoning based on correction feedback. Preferably, the method for forming the multidimensional heterogeneous inference space comprises the following steps: Acquiring norm distribution and gradient sensitivity indexes of each layer of a fused large model, constructing an importance distribution map, and dividing a core association region and a non-core region according to the relative sensitivity relation of each layer in the importance distribution map; structural pruning of connection importance is carried out on the n