US-20260127368-A1 - TECHNIQUES FOR ENFORCING STRUCTURED OUTPUT OF A GENERATIVE RESPONSE ENGINE

US20260127368A1US 20260127368 A1US20260127368 A1US 20260127368A1US-20260127368-A1

Abstract

The present technology is directed to a method for constraining an output of a generative response engine to be a valid output for a structured format defined in a request received by the generative response engine, the method includes receiving the request to generate content that conforms to the structured format using the generative response engine, where the request includes a schema that defines the structured format, and generating the response to the request to generate the content, where the response conforms to the structured format.

Inventors

Christopher Colby
Yun Jia Guan
Tomer Kaftan
Michal Pokrass
Ted Sanders
Brian Zhang

Assignees

OpenAi OPCo, LLC.

Dates

Publication Date: 20260507
Application Date: 20250131

Claims (20)

1 . A method for constraining an output of a generative response engine to be a valid output for a structured format defined in a request received by the generative response engine, the method comprising: receiving, by the generative response engine, the request to generate content that conforms to the structured format using the generative response engine, wherein the request includes a schema providing a syntax that defines valid output tokens for the structured format and wherein the generative response engine comprises a model trained to sample tokens from a probability distribution of possible tokens; receiving, by the model, a dynamic dictionary including possible tokens from which the model can sample, wherein the possible tokens in the dynamic dictionary are constrained by the schema; and generating, by the generative response engine, the response to the request to generate the content, wherein the response comprises tokens selected from the dynamic dictionary defining the possible tokens such that the response conforms to the structured format defined by the schema.
2 . The method of claim 1 , further comprising: determining by the generative response engine that the response should conform to the structured format, based on the determination to conform the response to the structured format, invoking a parser to assist the generative response engine in generating the response that conforms to the structured format.
3 . The method of claim 1 , further comprising: dynamically limiting the dynamic dictionary of tokens available for selection by the generative response engine to tokens that are considered to be valid according to the schema, wherein the generating the response includes generating the response from the tokens in the dynamic dictionary.
4 . The method of claim 3 , further comprising: building a parser based on a context-free grammar generated from the schema, wherein the parser has an initial parse configuration, a function mapping a current parse configuration and next character of input to a next parse configuration, and a set of final parse configurations, and wherein any of the initial parse configuration, the current parse configuration, the next parse configuration or the set of final parse configurations comprises a stack of parser parse configurations and a lexer parse configuration.
5 . The method of claim 4 , further comprising: generating, using the parser, an index, by: receiving a token of a string of tokens, wherein the request and the response are made up of tokens, the string of tokens including the tokens in the request and any incomplete response; and determining the set of final parse configurations for the token using the function, wherein the set of final parse configurations for the token is based on a precondition representing a predicate on the set of final parse configurations before the token can be consumed by the parser, wherein the precondition comprises a particular lexer parse configuration and a particular stack of parser parse configurations that are satisfied by the set of final parse configurations for the token.
6 . The method of claim 5 , wherein the particular stack of parser parse configurations of the precondition comprises at least one wildcard.
7 . The method of claim 5 , wherein the index is represented as an index trie comprising a trie path corresponding to the precondition of the string of tokens and wherein the sting of tokens are stored as a node in the index trie, the node in the index tree points to the tokens that are considered to be valid according to the schema and thereby defines the dynamic dictionary.
8 . The method of claim 7 , further comprising: generating an optimized index from the index by: precomputing a union of token sets from nodes of the index trie satisfying a given parse configuration such that the generative response engine can extract the set of final parse configurations from a single path in the index trie.
9 . The method of claim 5 , wherein the content is generated using the index to select valid tokens from the dynamic dictionary.
10 . The method of claim 1 , further comprising: receiving, by the generative response engine, a string comprising a special token; based on a type of the special token, triggering constrained sampling by the generative response engine, such that output of the generative response engine is constrained to conform to the schema.
11 . A computing system, comprising: at least one memory; and at least one processor coupled to the at least one memory and configured to: receive the request to generate content that conforms to the structured format using a generative response engine, wherein the request includes a schema providing a syntax that defines valid output tokens for the structured format and wherein the generative response engine comprises a model trained to sample tokens from a probability distribution of possible tokens; receive, by the model, a dynamic dictionary including possible tokens from which the model can sample, wherein the possible tokens in the dynamic dictionary are constrained by the schema; and generate, by the generative response engine, the response to the request to generate the content, wherein the response comprises tokens selected from the dynamic dictionary defining the possible tokens such that the response conforms to the structured format defined by the schema.
12 . The computing system of claim 11 , wherein the instructions further configure the computing system to: dynamically limit the dynamic dictionary of tokens available for selection by the generative response engine to tokens that are considered to be valid according to the schema, wherein the generating the response includes generating the response from the tokens in the dynamic dictionary.
13 . The computing system of claim 12 , wherein the instructions further configure the computing system to: generate a context-free grammar from the schema, wherein the context-free grammar is applied to perform the dynamically limiting the dynamic dictionary to the tokens available for output.
14 . The computing system of claim 13 , wherein the instructions further configure the computing system to: build a parser based on the context-free grammar, wherein the parser has an initial parse configuration, a function mapping a current parse configuration and next character of input to a next parse configuration, and a set of final parse configurations, and wherein any of the initial parse configuration, the current parse configuration, the next parse configuration or the set of final parse configurations comprises a stack of parser parse configurations and a lexer parse configuration.
15 . The computing system of claim 14 , wherein the instructions further configure the computing system to: generate, using the parser, an index, by: receiving a token of a string of tokens, wherein the request and the response are made up of tokens, the string of tokens including the tokens in the request and any incomplete response; and determining the set of final parse configurations for the token using the function, wherein the set of final parse configurations for the token is based on a precondition representing a predicate on the set of final parse configurations before the token can be consumed by the parser, wherein the precondition comprises a particular lexer parse configuration and a particular stack of parser parse configurations that are satisfied by the set of final parse configurations for the token.
16 . The computing system of claim 15 , wherein the particular stack of parser parse configurations of the precondition comprises at least one wildcard.
17 . The computing system of claim 15 , wherein the index is represented as an index trie comprising a trie path corresponding to the precondition of the string of tokens and wherein the sting of tokens are stored as a node in the index trie, the node in the index tree points to the tokens that are considered to be valid according to the schema and thereby defines the dynamic dictionary.
18 . The computing system of claim 17 , wherein the instructions further configure the computing system to: generate an optimized index from the index by: precomputing a union of token sets from nodes of the index trie satisfying a given parse configuration such that the generative response engine can extract the set of final parse configurations from a single path in the index trie.
19 . The computing system of claim 15 , wherein the content is generated using the index to select valid tokens from the dynamic dictionary.
20 . A non-transitory computer-readable storage medium comprising instructions that when executed by at least one processor, cause the at least one processor to: receive the request to generate content that conforms to the structured format using a generative response engine, wherein the request includes a schema providing a syntax that defines valid output tokens for the structured format and wherein the generative response engine comprises a model trained to sample tokens from a probability distribution of possible tokens; receive, by the model, a dynamic dictionary including possible tokens from which the model can sample, wherein the possible tokens in the dynamic dictionary are constrained by the schema; and generate, by the generative response engine, the response to the request to generate the content, wherein the response comprises tokens selected from the dynamic dictionary defining the possible tokens such that the response conforms to the structured format defined by the schema.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application No. 63/716,446, filed Nov. 5, 2024, entitled “TECHNIQUES FOR ENFORCING STRUCTURED OUTPUT OF A GENERATIVE RESPONSE ENGINE,” which is expressly incorporated herein by reference in its entirety. BACKGROUND Generative response engines such as large language models represent a significant milestone in the field of artificial intelligence, revolutionizing computer-based natural language understanding and generation. Generative response engines, powered by advanced deep learning techniques, have demonstrated astonishing capabilities in tasks such as text generation, translation, summarization, and even code generation. Generative response engines can sift through vast amounts of text data, extract context, and provide coherent responses to a wide array of queries. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS Details of one or more aspects of the subject matter described in this disclosure are set forth in the accompanying drawings and the description below. However, the accompanying drawings illustrate only some typical aspects of this disclosure and are therefore not to be considered limiting of its scope. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. FIG. 1 illustrates an example system supporting a generative response engine during inference operations in accordance with some aspects of the present technology. FIGS. 2A and 2B illustrate an example system for using a generative response engine to create a structured output in accordance with some aspects of the present technology. FIG. 2B illustrates an aspect of the subject matter in accordance with some aspects of the present technology. FIG. 3 illustrates an example trie in accordance with some aspects of the present technology. FIG. 4 illustrates an example raw index trie in accordance with some aspects of the present technology. FIGS. 5A and 5B illustrate an example of post-processing a raw index trie in accordance with some aspects of the present technology. FIG. 6 illustrates an example of a serialized index trie in accordance with some aspects of the present technology. FIG. 7 illustrates a method for constraining an output of a generative response engine to be a valid output for a structured format defined in a request received by the generative response engine in accordance with some aspects of the present technology. FIG. 8 is a block diagram illustrating an example machine-learning platform for implementing various aspects of this disclosure in accordance with some aspects of the present technology. FIG. 9A, FIG. 9B, and FIG. 9C illustrate an example transformer architecture in accordance with some aspects of the present technology. FIG. 10 shows an example of a system for implementing some aspects of the present technology. DETAILED DESCRIPTION Various aspects of the disclosure are discussed in detail below. While specific implementations are discussed, it should be understood that this is done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations may be used without parting from the spirit and scope of the disclosure. Generative response engines such as large language models represent a significant milestone in the field of artificial intelligence, revolutionizing computer-based natural language understanding and generation. Generative response engines, powered by advanced deep learning techniques, have demonstrated astonishing capabilities in tasks such as text generation, translation, summarization, and even code generation. However, despite their remarkable linguistic prowess, these generative response engines ultimately output text based on statistical probabilities, as opposed to explicit understanding. While the output of generative response engines is often very good, this is the result of the very good answer being probable based on the input. But outputs that are often very good are not sufficient in some circumstances. As an example, generative response engines are used to generate and develop code (e.g., executable code). A user account can provide a prompt requesting that the generative response engine outputs a segment of code in a particular programming language or syntax. Thus, generative response engines can, in theory, be used in efficiently writing and building executable program code in a number of different programming languages. While the output can often be very good, it can be difficult to ensure a generative response engine output matches the desired structured format. For example, the coding language might require that a block of code be encapsulated by open and closed brackets. To the extent that the generative response engine provides the open and closed brackets, this is generally a probabilistic result that occurs since much of the code it has been trained on includes