CN-122021864-A - Multi-model routing method, device and system oriented to language model reasoning

CN122021864ACN 122021864 ACN122021864 ACN 122021864ACN-122021864-A

Abstract

The invention discloses a multi-model routing method, device and system for language model reasoning, and belongs to the technical field of artificial intelligence natural language processing. The method comprises the steps of 1, deploying a server and a client, 2, receiving a request by the server to classify the task request, 3, comparing task classification probability with a confidence threshold value, routing the user request to a corresponding candidate model, 4, generating a semantic vector by a model coding request when the candidate model is a subject answer model or a code generation model, inputting the semantic vector to a trained difficulty estimator, judging the probability of opening depth thinking according to task answers acquired by the trained difficulty estimator, 5, calling the corresponding model to respond to the user request according to task classification and judgment of opening depth thinking, and returning the model request to the user.

Inventors

LI KAN
SHI JIAYI

Assignees

北京理工大学

Dates

Publication Date: 20260512
Application Date: 20251201

Claims (8)

1. A multi-model routing method oriented to language model reasoning is applied to a client and comprises a server and a plurality of clients, wherein the server is used for deploying the multi-model routing method, the clients are used for deploying a large language model, and the method is characterized by comprising the following steps, Step 1, deployment is carried out on a server and a client respectively; step 2, the server receives the task request sent by the user, and classifies the task request by using a trained task request classifier; Step 2.1, constructing a request task classifier which takes Qwen-8B-base as a semantic encoder and adopts a multi-layer perceptron to classify tasks of semantic vectors with fixed dimensionality mapped by the semantic encoder; step 2.2, classifying task requests by using a trained task request classifier; Step 3, setting a confidence coefficient threshold value, and comparing the task classification probability with the confidence coefficient threshold value to route the user request to a corresponding candidate model; step 4, when the candidate model is a subject answer model or a code generation model, generating a semantic vector by a model coding request, inputting the semantic vector to a trained difficulty estimator, and judging the probability of opening depth thinking according to task answers acquired by the trained difficulty estimator; step 4.1, constructing a difficulty estimator which takes a subject answer model or a code generation model as a semantic encoder and adopts a multi-layer perceptron to estimate the difficulty of a semantic vector with fixed dimension mapped by the semantic encoder; Step 4.2, inputting the semantic vector into a trained difficulty estimator, and acquiring task answer pair probability P_correct without starting deep thinking; step 4.3, setting a constraint threshold T, and when P_correction is smaller than T, enabling depth thinking; and 5, calling a corresponding model to respond to the user request according to task classification and judgment of starting depth thinking, and returning the model request to the user.
2. The method for multi-model routing oriented to language model reasoning of claim 1, wherein step 1 is implemented by the following steps, Step 1.1, deploying a request task classifier on a server; And step 1.2, deploying the candidate model and the difficulty estimator on the client.
3. The method for multi-model routing oriented to language model reasoning of claim 2, wherein step 1.2 is implemented by, Step 1.2.1, taking a boring consultation model, a creative writing model, a subject answer model and a code generation model as candidate models, wherein the subject answer model and the code generation model have deep thinking capability; And 1.2.2, generating a model configuration difficulty estimator for the subject answer model and the code.
4. The method for multi-model routing oriented to language model reasoning of claim 1, wherein step 2.1 is implemented by, Marking task types of user request data, taking the user request data and the request task as supervision samples, freezing Qwen-8B-base semantic encoders, and classifying tasks by using a multi-layer perceptron to semantic vectors with fixed dimensionality mapped by the semantic encoders; and 2.1.2, training the request task classifier by using the cross entropy loss and AdamW optimizer to obtain a trained request task classifier.
5. The method for multi-model routing oriented to language model reasoning of claim 1, wherein step 4.1 is implemented by the following steps, Marking response correctness of user request data under the condition of unopened deep thinking, taking the user request data and the response correctness as supervision samples, freezing a semantic encoder of a subject answer model or a code generation model, and estimating difficulty of semantic vectors of fixed dimensions mapped by the semantic encoder by using a multi-layer perceptron; and 4.1.2, training the request task classifier by using the cross entropy loss and AdamW optimizer to obtain a trained difficulty estimator.
6. The multi-model routing system for realizing language model reasoning of claim 1, comprising a request access and preprocessing module, a task classification module and a user request routing and response module; The request access and preprocessing module is used for requesting compliance authentication by a user and taking the user request of the compliance authentication as the input of the task classification module; The task classification module comprises a semantic encoder and a task classifier, is used for classifying tasks of user requests, and takes the task classification module as input of a user request routing and response module; the user request routing and responding module comprises a candidate model and a difficulty estimator, and is used for judging the starting state of the deep thinking and responding to the user request, and returning the response of the user request to the user and taking the response as the output of the system.
7. The language model reasoning oriented multi-model routing system of claim 6, wherein: the semantic encoder is used for vector encoding of the user request and takes the user request as input of the task classifier; The task classifier is used for classifying tasks of the user request vector codes and taking task classification results as output of the task classification module.
8. A multi-model routing device oriented to language model reasoning is characterized by comprising a client and a server; The client consists of a high-performance parallel computing processing unit for executing the judgment of the starting state of the deep thinking and responding to the user request by the model and a communication interface for returning the response of the user request, so that the judgment of the starting state of the deep thinking model and the response of the model to the user request and the response result are realized; The server is used for realizing task classification of the user request by a communication interface for receiving the user request and a high-performance parallel computing processing unit for task classification.

Description

Multi-model routing method, device and system oriented to language model reasoning Technical Field The invention relates to a multi-model routing method, device and system for language model reasoning, which belong to the technical field of artificial intelligence natural language processing and are applied to a model routing shunt scene of a user request. Background With the wide application of large language models (Large Language Model, LLM) in tasks such as dialogue, creation, discipline reasoning and code generation, the problems of high computational cost, large response time delay, difficult guarantee of stability and the like caused by single model processing request are increasingly prominent. The industry gradually turns to a multi-model orchestration (orchestr ation)/routing (routing) paradigm, i.e. models with different capabilities and scales are deployed on the same reasoning platform, and requests are shunted through strategies so as to finish tasks with the least overhead model capable of solving problems as much as possible. The existing scheme depends on heuristic rules or manually preset thresholds, so that the overall optimization of task diversity, dynamic load, precision and cost is difficult to consider. Therefore, how to select a language model with the lowest cost rate according to the difficulty and model capability of the user request has become a urgent problem to be solved. Disclosure of Invention The invention aims at solving the technical problem of selecting a language model with the lowest cost rate according to the difficulty and model capacity of a user request, and provides a multi-model routing method, device and system oriented to language model reasoning. The method comprises the steps of carrying out type recognition and difficulty estimation on a user request, combining overhead estimation and success rate prediction, automatically selecting a minimum overhead model capable of solving the request, judging whether a deep thinking mechanism is started or not based on answer probability in an answer and coding task, and obviously reducing reasoning cost and time delay on the premise of guaranteeing quality. The invention aims at realizing the following technical scheme: On the one hand, the multi-model routing method oriented to language model reasoning is applied to a server and a plurality of clients, wherein the server is used for deploying the multi-model routing method, the clients are used for deploying a large language model, and the method comprises the following steps: Step 1, deployment is carried out on a server and a client respectively; Step 1.1, deploying a request task classifier on a server; Step 1.2, deploying a candidate model and a difficulty estimator on a client; Step 1.2.1, taking a boring consultation model, a creative writing model, a subject answer model and a code generation model as candidate models, wherein the subject answer model and the code generation model have deep thinking capability; step 1.2.2, generating a model configuration difficulty estimator for the subject answer model and codes; step 2, the server receives the task request sent by the user, and classifies the task request by using a trained task request classifier; Step 2.1, constructing a request task classifier which takes Qwen-8B-base as a semantic encoder and adopts a multi-layer perceptron to classify tasks of semantic vectors with fixed dimensionality mapped by the semantic encoder; Marking task types of user request data, taking the user request data and the request task as supervision samples, freezing Qwen-8B-base semantic encoders, and classifying tasks by using a multi-layer perceptron to semantic vectors with fixed dimensionality mapped by the semantic encoders; step 2.1.2, training the request task classifier by using a cross entropy loss and AdamW optimizer to obtain a trained request task classifier; step 2.2, classifying task requests by using a trained task request classifier; Step 3, setting a confidence coefficient threshold value, and comparing the task classification probability with the confidence coefficient threshold value to route the user request to a corresponding candidate model; step 4, when the candidate model is a subject answer model or a code generation model, generating a semantic vector by a model coding request, inputting the semantic vector to a trained difficulty estimator, and judging the probability of opening depth thinking according to task answers acquired by the trained difficulty estimator; step 4.1, constructing a difficulty estimator which takes a subject answer model or a code generation model as a semantic encoder and adopts a multi-layer perceptron to estimate the difficulty of a semantic vector with fixed dimension mapped by the semantic encoder; Marking response correctness of user request data under the condition of unopened deep thinking, taking the user request data and the response correctness as supervision samples, fre