CN-121996415-A - AI service processing method, device, medium and equipment

CN121996415ACN 121996415 ACN121996415 ACN 121996415ACN-121996415-A

Abstract

The application relates to a processing method, a device, a medium and equipment for an AI service, wherein the processing method for the AI service is applied to an edge processing node and comprises the steps of receiving an access request of a request end for the AI service of a large model, extracting a prompt text in the access request, converting the prompt text into a characteristic vector with specified dimension, generating a cache key based on the characteristic vector, inquiring a local cache, responding the cache content to the request end if the cache key and the cache content are inquired, determining a target model from a plurality of large models based on the prompt text if the cache key is not inquired, acquiring response content from the target model, responding the response content to the request end, and storing the response content as a value of the cache key in the local cache. Corresponding caches can be generated according to different semantics of the prompt text, the problem that the similarity of the semantics cannot be identified by traditional character string matching is solved, the cache hit rate is improved, the utilization rate of large model resources is optimized, the success rate of AI service is improved, and the cost is reduced.

Inventors

WAN WEISONG
LI JINFENG
TONG JIAN

Assignees

杭州缘算科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251231

Claims (18)

1. A method for processing AI services, applied to an edge processing node, comprising: receiving an access request of a request end for AI service of a large model, and extracting a prompt text in the access request; Converting the prompt text into a feature vector with a specified dimension, and generating a cache key based on the feature vector; Inquiring a local cache, and responding the cache content to the request end if the cache key and the cache content are inquired, wherein the response delay for responding the cache content to the request end is smaller than the response delay for calling a big model to acquire the response content; and acquiring response content from the target model, responding the response content to the request end, and storing the response content in a local cache as a value of the cache key.
2. The AI service processing method of claim 1, wherein generating a cache key based on the feature vector comprises: And sorting the feature vectors with the appointed dimension according to the descending order of absolute values, extracting the first N feature calculation hash values, and generating a cache key based on the hash values.
3. The AI service processing method of claim 2, wherein generating a cache key based on the hash value includes: and extracting the first M bits of the hash value to embed a designated format, and generating the cache key.
4. The AI service processing method of claim 1, wherein the determining a target model from a plurality of large models based on the hint text comprises: If the access request comprises information of a designated model, determining the designated model as a target model; And if the access request does not include information for specifying the model, determining a target model according to model selection logic.
5. The AI service processing method of claim 4 wherein determining a target model in accordance with model selection logic comprises: Acquiring real-time price, residual request limit and delay time of each large model; Determining weighted scores for the real-time price, the residual request amount and the delay time, so that the AI service dynamically adjusts the calling tendency of the large models according to the weighted scores; and selecting a target model from the large models with low weighted scores.
6. The AI service processing method of claim 5 wherein the determining a weighted score for the real-time price, remaining request credit, delay time includes: Normalizing the real-time price relative to the price of the reference model to obtain a price normalized value, inversely calculating the residual request amount to obtain an availability inverse value, normalizing the delay time relative to a preset delay threshold to obtain a delay normalized value, and carrying out weighted summation on the price normalized value, the availability inverse value and the delay normalized value according to a first weight, a second weight and a third weight to obtain the weighted score.
7. The AI service processing method of claim 5 wherein selecting a target model from the large model with the lower weighted score comprises: Selecting the big model with the lowest weighted score as the target model, or And selecting the large model with the weighted score lower than a preset value and the residual request limit larger than the preset value as the target model.
8. The AI service processing method of claim 1, further comprising flushing the response content prior to responding the response content to the request terminal; The cleaning of the response content comprises the following steps: converting the response content into an object using a JSON parser; traversing the object, deleting the fields comprising the user and the system_ FINGERPRINT, LOGPROBS; re-serializing into JSON strings.
9. The AI service processing method of claim 1, the AI service processing method is characterized by further comprising the following steps: Setting the priority of the AI service process as the lowest priority; and configuring the maximum CPU occupancy rate, the maximum memory usage space and the maximum read-write bandwidth of the AI service.
10. The AI service processing method of claim 1, the AI service processing method is characterized by further comprising the following steps: Collecting the cache hit rate, the total CPU occupancy rate and the memory usage of the AI service in real time; And when the cache hit rate is smaller than a first preset threshold, or the total CPU occupancy rate is larger than a second preset threshold, or the memory usage amount of the AI service is larger than a third preset threshold, the response content is not stored.
11. The AI service processing method of claim 1, the AI service processing method is characterized by further comprising the following steps: Aiming at the access request of each AI service, recording the access cost, the client ID, the target model and whether the cache is hit or not; generating a client bill, and displaying accumulated saving amount, model use distribution, cache hit rate and return on investment to the client through the management interface of the AI service.
12. The AI service processing method of claim 1 wherein the edge processing node multiplexes the file push and configuration delivery channels of the management system to receive mirror packets of AI services.
13. A processing apparatus for AI services, applied to an edge processing node, comprising: The request receiving module is used for receiving an access request of a request end for the AI service of the large model and extracting a prompt text in the access request; The cache key generation module is used for converting the prompt text into feature vectors with specified dimensions and generating cache keys based on the feature vectors; The query module is used for querying the local cache, responding the cache content to the request terminal if the cache key and the cache content are queried, wherein the response delay for responding the cache content to the request terminal is smaller than the response delay for calling a big model to acquire the response content; And the caching module is used for acquiring response content from the target model, responding the response content to the request end, and storing the response content in a local cache as a value of the cache key.
14. The AI service processing apparatus of claim 13, further comprising: The configuration module is used for setting the priority of the AI service process as the lowest priority, and configuring the maximum CPU occupancy rate, the maximum memory usage space and the maximum read-write bandwidth of the AI service.
15. The AI service processing apparatus of claim 13, further comprising: and when the cache hit rate is smaller than a first preset threshold, or the total CPU occupancy rate is larger than a second preset threshold, or the memory usage of the AI service is larger than a third preset threshold, the response content is not stored.
16. The AI service processing apparatus of claim 13, further comprising: And the bill generation module is used for recording the access cost, the client ID, the target model and whether the cache is hit or not according to the access request of each AI service, generating a client bill, and displaying the accumulated saving amount, the model use distribution, the cache hit rate and the return on investment for the client through the management interface of the AI service.
17. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed, implements the steps of the method according to any one of claims 1-12.
18. A computer device comprising a processor, a memory and a computer program stored on the memory, characterized in that the processor implements the steps of the method according to any of claims 1-12 when the computer program is executed.

Description

AI service processing method, device, medium and equipment Technical Field The present application relates to the field of edge computing, and in particular, to a method, an apparatus, a medium, and a device for processing an AI service. Background The existing large language model interface aggregation platform generally adopts a centralized architecture, and each time a large model is called, payment is required to a model manufacturer, and a caching mechanism is not needed. The measured data shows that about 35% of user problems in customer service scenarios are repetitive consultations, repeated or similar requests repeatedly invoke large models, resulting in a waste of large amounts of funds. And the inter-region calling large model has higher delay, and cannot meet the real-time interaction scene. While the CPU average load of the CDN edge nodes is only 20%,80% of the computational power is not utilized. Therefore, the residual computing power of the prior CDN edge node is utilized to provide AI service, and the AI request for accessing the large model is provided with a cache, so that the times of calling the large model can be greatly reduced, and the fund cost is saved. And can provide acceleration for AI service, raise response speed. In the process of constructing an AI service cache by utilizing edge calculation force, the following technical problems still need to be solved: On the one hand, the traditional caching technology is based on character string accurate matching, and cannot process semantic similarity of AI requests, so that cache hit rate is low. In addition, when the edge node provides CDN service and AI service at the same time, memory squeeze or I/O contending occurs easily in the two services, and the service quality of CDN core service is difficult to guarantee. Disclosure of Invention In order to overcome the problems in the related art, the application provides a processing method, a device, a medium and equipment of an AI service. According to a first aspect of an embodiment of the present application, there is provided a processing method of an AI service, applied to an edge processing node, including: receiving an access request of a request end for AI service of a large model, and extracting a prompt text in the access request; Converting the prompt text into a feature vector with a specified dimension, and generating a cache key based on the feature vector; Inquiring a local cache, and responding the cache content to the request end if the cache key and the cache content are inquired, wherein the response delay for responding the cache content to the request end is smaller than the response delay for calling a big model to acquire the response content; and acquiring response content from the target model, responding the response content to the request end, and storing the response content in a local cache as a value of the cache key. Based on the foregoing, in some embodiments of the present application, the generating a cache key based on the feature vector includes: And sorting the feature vectors with the appointed dimension according to the descending order of absolute values, extracting the first N feature calculation hash values, and generating a cache key based on the hash values. Based on the foregoing, in some embodiments of the present application, generating a cache key based on the hash value includes: and extracting the first M bits of the hash value to embed a designated format, and generating the cache key. Based on the foregoing, in some embodiments of the present application, the determining the target model from the plurality of large models based on the hint text includes: If the access request comprises information of a designated model, determining the designated model as a target model; And if the access request does not include information for specifying the model, determining a target model according to model selection logic. Based on the foregoing, in some embodiments of the application, the determining the target model according to the model selection logic comprises: Acquiring real-time price, residual request limit and delay time of each large model; Determining weighted scores for the real-time price, the residual request amount and the delay time, so that the AI service dynamically adjusts the calling tendency of the large models according to the weighted scores; and selecting a target model from the large models with low weighted scores. Based on the foregoing, in some embodiments of the present application, the determining the weighted score for the real-time price, remaining request credit, delay time includes: Normalizing the real-time price relative to the price of the reference model to obtain a price normalized value, inversely calculating the residual request amount to obtain an availability inverse value, normalizing the delay time relative to a preset delay threshold to obtain a delay normalized value, and carrying out weighted summ