CN-122019143-A - Large model recommendation method, device, medium and equipment

CN122019143ACN 122019143 ACN122019143 ACN 122019143ACN-122019143-A

Abstract

The application relates to a large model recommending method, a large model recommending device, a large model recommending medium and large model recommending equipment, wherein the large model recommending method, the large model recommending device, the medium and the large model recommending equipment are applied to an API gateway, the large model recommending method comprises the steps of obtaining cache data for a plurality of large models, determining semantic similarity of the current session and the cache data of the plurality of large models, selecting a recommending model from the plurality of large models when the semantic similarity is larger than or equal to a similarity threshold value, determining a recommending index of the recommending model, generating model recommending information when the recommending index meets preset conditions, and sending the model recommending information to a requester of the current session so that the requester can determine whether the recommending model is adopted in a follow-up session. A cache hierarchy across models may be built so that semantically similar requests may be identified. Based on the semantic similarity, large models with similar performance and higher cost performance in the same knowledge field are recommended to users, resource allocation is effectively invoked by the optimization model, and the computing power utilization rate is improved.

Inventors

WAN WEISONG
LI JINFENG
TONG JIAN

Assignees

杭州缘算科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251231

Claims (13)

1. A large model recommendation method, applied to an API gateway, comprising: Obtaining cache data aiming at a plurality of large models, wherein the cache data comprises request prompt words and response information; determining semantic similarity of request prompting words and response information of a current session and request prompting words and response information of cache data of the plurality of large models; When the semantic similarity is greater than or equal to a similarity threshold, selecting a recommendation model from the plurality of large models, and determining a recommendation index of the recommendation model; Generating model recommendation information when the recommendation index meets a preset condition; And sending the model recommendation information to a requester of the current session so that the requester can determine whether a recommendation model is adopted in a subsequent session.
2. The large model recommendation method of claim 1, wherein determining semantic similarity of request hint words and response information of a current session and request hint words and response information of cached data of the plurality of large models comprises: determining the similarity of the prompt word of the request prompt word of the current session and the prompt word of the request prompt word of the cache data; When the similarity of the prompt words is larger than a preset threshold value, determining the response similarity of the response information of the current session and the response information of the cache data; and determining semantic similarity based on the prompt word similarity and the response similarity.
3. The large model recommendation method of claim 2, wherein determining semantic similarity based on the hint word similarity and response similarity comprises: And determining the weighted sum of the prompt word similarity and the response similarity as semantic similarity.
4. The large model recommendation method of claim 2 wherein the cache data further comprises model names, performance information of models, wherein selecting a recommendation model among the plurality of large models, and determining recommendation indexes of the recommendation models comprises: Determining that the model with the semantic similarity being greater than a similarity threshold is a recommendation model; Acquiring performance information of an original model and performance information of a recommended model corresponding to a current session, wherein the performance information comprises cost, delay and response quality; A recommendation index is determined based on the cost savings amount, the delay reduction amount, and the response quality of the recommendation model relative to the original model.
5. The large model recommendation method of claim 4, wherein determining a recommendation index comprises: And obtaining recommendation weights, wherein the recommendation weights comprise cost weights, delay weights and response quality weights, and performing weighted fusion calculation on the cost saving amount, the delay reduction amount and the response quality to obtain the recommendation index.
6. The large model recommendation method according to claim 1, wherein the preset conditions include: the recommendation index is larger than a preset value, or The recommendation index is greater than a preset value and the model cost savings is greater than a preset percentage.
7. The large model recommendation method of claim 5, further comprising: detecting subsequent requests of the requesters, and judging whether each requester adopts a recommendation model or not; and counting the adoption rate in unit time, and adjusting the similarity threshold or the recommendation weight based on the adoption rate.
8. The large model recommendation method of claim 7, wherein adjusting the similarity threshold based on the adoption rate comprises: When the adoption rate is lower than a first target value, the similarity threshold is increased; And when the adoption rate is higher than a second target value, reducing the similarity threshold.
9. The large model recommendation method of claim 7, wherein adjusting the recommendation weights based on the adoption rate comprises: And when the adoption rate is located in the first target value and the second target value, adjusting the recommendation weight according to the cost saving ratio, the response quality and the delay time after the recommendation model is adopted.
10. A large model recommendation device, applied to an API gateway, comprising: the cache module is used for acquiring cache data aiming at a plurality of large models, wherein the cache data comprises request prompt words and response information; the comparison module is used for determining semantic similarity between the request prompting words and response information of the current session and the request prompting words and response information of the cache data of the large models; The selection module is used for selecting a recommendation model from the plurality of large models and determining a recommendation index of the recommendation model when the semantic similarity is larger than or equal to a similarity threshold; the recommendation information generation module is used for generating model recommendation information when the recommendation index meets a preset condition; and the response module is used for sending the model recommendation information to a requester of the current session so that the requester can determine whether a recommendation model is adopted in a subsequent session.
11. The large model recommendation device of claim 10, further comprising: The detection module is used for detecting the subsequent requests of the requesters and judging whether each requester adopts a recommendation model or not; And the adjustment module is used for counting the adoption rate in unit time and adjusting the similarity threshold value or the recommendation weight based on the adoption rate.
12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed, implements the steps of the method according to any one of claims 1-9.
13. A computer device comprising a processor, a memory and a computer program stored on the memory, characterized in that the processor implements the steps of the method according to any of claims 1-9 when the computer program is executed.

Description

Large model recommendation method, device, medium and equipment Technical Field The application relates to the technical field of cloud computing and Large Language Model (LLM) scheduling, in particular to a large model recommending method, a device, a medium and equipment. Background With the popularity of many large models, enterprises face significant cost pressures when using LLMs. Taking a certain high-parameter closed source model as an example, the unit price is relatively high, and the cost advantage of a part of lightweight models or open source models is obvious, and the price difference can be several times. In high-frequency calling scenes (customer service and content generation), monthly cost can reach tens of thousands to hundreds of thousands of dollars, and the cost performance of different models is obviously different. In the related art, a client cache only caches a request of the same model, semantic similarity cannot be compared across models, model layer optimization (such as quantization and distillation) needs to modify a model structure, online flexible switching cannot be achieved, a static routing strategy selects a model based on rules or loads, and the model cannot be dynamically selected by combining real semantics and model quality, and in addition, the prior art lacks a feedback closed loop, so that continuous optimization is difficult. Therefore, how to construct a response cache system crossing models, so that semantically similar requests accessing different large models can be identified, and a large model recommendation strategy for obtaining trade-off among cost, delay and semantic consistency is designed according to the request, which is a core technical problem to be solved at present. On the basis, how to continuously track the adoption behaviors of the users and realize closed-loop optimization and dynamic scene adaptation of the recommendation strategy based on feedback data is also a technical problem to be further solved in the field. Disclosure of Invention In order to overcome the problems in the related art, the application provides a large model recommendation method, a large model recommendation device, a large model recommendation medium and large model recommendation equipment. According to a first aspect of an embodiment of the present application, there is provided a large model recommendation method applied to an API gateway, including: Obtaining cache data aiming at a plurality of large models, wherein the cache data comprises request prompt words and response information; determining semantic similarity of request prompting words and response information of a current session and request prompting words and response information of cache data of the plurality of large models; When the semantic similarity is greater than or equal to a similarity threshold, selecting a recommendation model from the plurality of large models, and determining a recommendation index of the recommendation model; Generating model recommendation information when the recommendation index meets a preset condition; And sending the model recommendation information to a requester of the current session so that the requester can determine whether a recommendation model is adopted in a subsequent session. Based on the foregoing, in some embodiments of the present application, determining the semantic similarity between the request hint word and the response information of the current session and the request hint word and the response information of the cached data of the plurality of large models includes: determining the similarity of the prompt word of the request prompt word of the current session and the prompt word of the request prompt word of the cache data; When the similarity of the prompt words is larger than a preset threshold value, determining the response similarity of the response information of the current session and the response information of the cache data; and determining semantic similarity based on the prompt word similarity and the response similarity. Based on the foregoing, in some embodiments of the present application, determining semantic similarity based on the hint word similarity and the response similarity includes: And determining the weighted sum of the prompt word similarity and the response similarity as semantic similarity. Based on the foregoing, in some embodiments of the present application, the cache data further includes a model name and performance information of the model, and the selecting a recommendation model from the plurality of large models and determining a recommendation index of the recommendation model includes: Determining that the model with the semantic similarity being greater than a similarity threshold is a recommendation model; Acquiring performance information of an original model and performance information of a recommended model corresponding to a current session, wherein the performance information comprises cost, de