US-12626070-B2 - Serverless functional routing for large language model inference service

US12626070B2US 12626070 B2US12626070 B2US 12626070B2US-12626070-B2

Abstract

A computer-implemented method for serving a large language model (LLM) application via a serverless function router communicative with multiple endpoints that each have a set of subject matter expert models stored thereon is provided. The computer-implemented method includes receiving a prompt, querying a database comprising multiple datasets for an indication as to which one of the multiple datasets has a highest level of similarity with the prompt, recognizing one of the multiple endpoints as having the set of the expert models stored thereon which have a closest match with the one of the multiple datasets and routing the prompt to the one of the multiple endpoints having the set of the expert models stored thereon which have the closest match with the one of the multiple datasets.

Inventors

Bo Wen
Chen Wang
Huamin Chen

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260512
Application Date: 20240208

Claims (17)

1 . A computer-implemented method for serving a large language model (LLM) application, the computer-implemented method comprising: disposing a serverless function router in communication with multiple discrete servers serving as multiple endpoints; grouping subject matter expert models into sets of subject matter expert models and storing each set of subject matter expert models on one of the multiple endpoints; receiving a prompt at the serverless function router; querying, by the serverless function router, a database comprising multiple datasets for an indication as to which one of the multiple datasets has a highest level of similarity with the prompt; recognizing, by the serverless function router, one of the multiple endpoints as having the set of the expert models stored thereon which have a closest match with the one of the multiple datasets; and routing, by the serverless function router, the prompt to the one of the multiple endpoints having the set of the expert models stored thereon which have the closest match with the one of the multiple datasets.
2 . The computer-implemented method according to claim 1 , wherein the database comprises a vector database.
3 . The computer-implemented method according to claim 1 , wherein each of the multiple datasets has a closest match with the subject matter expert models stored on one of the multiple endpoints.
4 . The computer-implemented method according to claim 3 , wherein: the set of subject matter expert models of a first one of the multiple endpoints is configured to handle prompts relating to medical subject matter, the set of subject matter expert models of a second one of the multiple endpoints is configured to handle prompts relating to financial subject matter, and the set of subject matter expert models of a third one of the multiple endpoints is configured to handle prompts relating to technical subject matter.
5 . The computer-implemented method according to claim 3 , wherein each of the sets of subject matter expert models of each of the first, second and third ones of the multiple endpoints comprises one or more foundation models and one or more fine-tuned models.
6 . The computer-implemented method according to claim 1 , wherein the serverless function router is a prompt aware router and the routing comprises prompt aware routing.
7 . A computer program product for serving a large language model (LLM) application, the computer program product comprising one or more computer readable storage media having computer readable program code collectively stored on the one or more computer readable storage media, the computer readable program code being executed by a processor of a computer system to cause the computer system to perform a method comprising: disposing a serverless function router in communication with multiple discrete servers serving as multiple endpoints; grouping subject matter expert models into sets of subject matter expert models and storing each set of subject matter expert models on one of the multiple endpoints; receiving a prompt at the serverless function router; receiving a prompt at the serverless function router; querying, by the serverless function router, a database comprising multiple datasets for an indication as to which one of the multiple datasets has a highest level of similarity with the prompt; recognizing, by the serverless function router, one of the multiple endpoints as having the set of the expert models stored thereon which have a closest match with the one of the multiple datasets; and routing, by the serverless function router, the prompt to the one of the multiple endpoints having the set of the expert models stored thereon which have the closest match with the one of the multiple datasets.
8 . The computer program product according to claim 7 , wherein the database comprises a vector database.
9 . The computer program product according to claim 7 , wherein each of the multiple datasets has a closest match with the subject matter expert models stored on one of the multiple endpoints.
10 . The computer program product according to claim 9 , wherein: the set of subject matter expert models of a first one of the multiple endpoints is configured to handle prompts relating to medical subject matter, the set of subject matter expert models of a second one of the multiple endpoints is configured to handle prompts relating to financial subject matter, and the set of subject matter expert models of a third one of the multiple endpoints is configured to handle prompts relating to technical subject matter.
11 . The computer program product according to claim 9 , wherein each of the sets of subject matter expert models of each of the first, second and third ones of the multiple endpoints comprises one or more foundation models and one or more fine-tuned models.
12 . The computer program product according to claim 7 , wherein the serverless function router is a prompt aware router and the routing comprises prompt aware routing.
13 . A computing system comprising: a processor; a memory coupled to the processor; and one or more computer readable storage media coupled to the processor, the one or more computer readable storage media collectively containing instructions that are executed by the processor via the memory to implement a method for serving a large language model (LLM) application comprising: disposing a serverless function router in communication with multiple discrete servers serving as multiple endpoints; grouping subject matter expert models into sets of subject matter expert models and storing each set of subject matter expert models on one of the multiple endpoints; receiving a prompt at the serverless function router; receiving a prompt at the serverless function router; querying, by the serverless function router, a database comprising multiple datasets for an indication as to which one of the multiple datasets has a highest level of similarity with the prompt; recognizing, by the serverless function router, one of the multiple endpoints as having the set of the expert models stored thereon which have a closest match with the one of the multiple datasets; and routing, by the serverless function router, the prompt to the one of the multiple endpoints having the set of the expert models stored thereon which have the closest match with the one of the multiple datasets.
14 . The computing system according to claim 13 , wherein: the database comprises a vector database, and each of the multiple datasets has a closest match with the subject matter expert models stored on one of the multiple endpoints.
15 . The computing system according to claim 14 , wherein: the set of subject matter expert models of a first one of the multiple endpoints is configured to handle prompts relating to medical subject matter, the set of subject matter expert models of a second one of the multiple endpoints is configured to handle prompts relating to financial subject matter, and the set of subject matter expert models of a third one of the multiple endpoints is configured to handle prompts relating to technical subject matter.
16 . The computing system according to claim 14 , wherein each of the sets of subject matter expert models of each of the first, second and third ones of the multiple endpoints comprises one or more foundation models and one or more fine-tuned models.
17 . The computing system according to claim 13 , wherein the serverless function router is a prompt aware router and the routing comprises prompt aware routing.

Description

BACKGROUND The present invention generally relates to large language models in computing systems. More specifically, the present invention relates to cost-effective and quality-assured serverless functional routing for a large language model (LLM) inference service. An LLM is a language model that is notable for its ability to achieve general-purpose language generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training processes. While LLMs are generally artificial neural networks that are can be built with transformer-based architectures, recent implementations have been based on alternative architectures, such as recurrent neural network variants. As an example, LLMs can be used for text generation and other forms of generative artificial intelligence (AI). In these or other cases, LLMs take input text and repeatedly predict next tokens or words. Until recently, fine tuning was the only way a given LLM could be adapted to be able to accomplish specific tasks. It has been found, however, that modern large LLMs can be prompt-engineered to achieve positive results by acquiring knowledge about syntax, semantics and ontology inherent in human language. SUMMARY According to an aspect of the disclosure, a computer-implemented method for serving a large language model (LLM) application via a serverless function router communicative with multiple endpoints that each have a set of subject matter expert models stored thereon is provided. The computer-implemented method includes receiving a prompt, querying a database comprising multiple datasets for an indication as to which one of the multiple datasets has a highest level of similarity with the prompt, recognizing one of the multiple endpoints as having the set of the expert models stored thereon which have a closest match with the one of the multiple datasets and routing the prompt to the one of the multiple endpoints having the set of the expert models stored thereon which have the closest match with the one of the multiple datasets. In additional or alternative embodiments, the computer-implemented method provides for a fast response to a prompt that does not waste valuable computing resources. According to an aspect of the disclosure, a computer program product for serving a large language model (LLM) application via a serverless function router communicative with multiple endpoints that each have a set of subject matter expert models stored thereon is provided. The computer program product includes one or more computer readable storage media having computer readable program code collectively stored on the one or more computer readable storage media. The computer readable program code is executed by a processor of a computer system to cause the computer system to perform a method. The method includes receiving a prompt, querying a database comprising multiple datasets for an indication as to which one of the multiple datasets has a highest level of similarity with the prompt, recognizing one of the multiple endpoints as having the set of the expert models stored thereon which have a closest match with the one of the multiple datasets and routing the prompt to the one of the multiple endpoints having the set of the expert models stored thereon which have the closest match with the one of the multiple datasets. In additional or alternative embodiments, the method provides for a fast response to a prompt that does not waste valuable computing resources. According to an aspect of the disclosure, a computing system is provided and includes a processor, a memory coupled to the processor and one or more computer readable storage media coupled to the processor. The one or more computer readable storage media collectively contain instructions that are executed by the processor via the memory to implement a method for serving a large language model (LLM) application via a serverless function router communicative with multiple endpoints that each have a set of subject matter expert models stored thereon. In additional or alternative embodiments, the method for serving the LLM application via a serverless function router communicative with multiple endpoints that each have a set of subject matter expert models stored thereon provides for a fast response to a prompt that does not waste valuable computing resources. Additional technical features and benefits are realized through the techniques of the present invention. Embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed subject matter. For a better understanding, refer to the detailed description and to the drawings. BRIEF DESCRIPTION OF THE DRAWINGS The specifics of the exclusive rights described herein are particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other features and adva