US-12619819-B1 - Identifying provenance information of a data item generated by a generative machine learning model

US12619819B1US 12619819 B1US12619819 B1US 12619819B1US-12619819-B1

Abstract

Metadata may be identified for text generated by a generative machine learning model. A text is obtained and a weighting scheme determine for performing similarity analysis. Different similarity analysis techniques are performed that compare the text with representations of texts in the training data set for the generative machine learning model. Final similarity scores are generated that combine the different similarity analysis techniques according to the weighting scheme and are used to select metadata to provide that is relevant to the text.

Inventors

Jiangtao ZHANG
Ramu Panayappan
Mark Fawaz
Vijay Dheeraj Reddy Mandadi
Sreenaath Vasudevan
Raviprasad V Mummidi

Assignees

AMAZON TECHNOLOGIES, INC.

Dates

Publication Date: 20260505
Application Date: 20230919

Claims (20)

1 . A system, comprising: at least one processor; and a memory, storing program instructions that when executed by the at least one processor, cause the at least one processor to implement a metadata identification system, configured to: receive, via an interface, a request to search for metadata relevant to code generated by a language model trained using a machine learning technique applied to training data comprising a plurality of code; identify a weighting scheme for performing similarity analysis with respect to the code generated by the language model; cause performance of a plurality of different similarity analysis techniques that compare the code with one or more representations of the plurality of code of the training data set; generate respective final similarity scores between the code and the one or more representations of the plurality of code according to the weighting scheme, wherein the weighting scheme indicates respective weights for combining individual similarity scores generated by the different similarity analysis techniques into the respective final similarity scores; select one representation of the one or more representations according to the respective final similarity scores between the code and the one or more representations of the plurality of code; and return, via the interface, metadata corresponding to the selected one representation as descriptive of the code generated by the language model.
2 . The system of claim 1 , wherein the metadata identification system is configured to: receive, via the interface, feedback regarding the returned metadata for the code; and update the weighting scheme according to the received feedback.
3 . The system of claim 1 , wherein the metadata identification system is configured to: receive one or more similarity parameters via the interface to update the search; update the weighting scheme according to the one or more similarity parameters; generate new respective final similarity scores between the code and the one or more representations of the plurality of code according to the updated weighting scheme; select the one representation or another one of the one or more representations according to the respective new final similarity scores between the code and the one or more representations of the plurality of code; and return, via the interface, the metadata corresponding to the selected one representation or further metadata corresponding to the other one representation as descriptive of the code generated by the language model.
4 . The system of claim 1 , wherein the metadata identification system is implemented as part of a code development service of a provider network, wherein the code was generated to perform a refactoring task for an input code provided to the code development service.
5 . A method, comprising: obtaining, at a metadata identification system, a data item generated by a generative machine learning model trained using a machine learning technique applied to training data comprising a plurality of data items; determining, by the metadata identification system, a weighting scheme for performing similarity analysis with respect to the data item generated by the generative machine learning model; performing, by the metadata identification system, a plurality of different similarity analysis techniques that compare the data item with one or more representations of the plurality of data items of the training data set; generating, by the metadata identification system, respective final similarity scores between the data item and the one or more representations of the plurality of data items according to the weighting scheme, wherein the weighting scheme indicates respective weights for combining individual similarity scores generated by the different similarity analysis techniques into the respective final similarity scores; selecting, by the metadata identification system, one representation of the one or more representations according to the respective final similarity scores between the data item and the one or more representations of the plurality of data items; and providing, by the metadata identification system, metadata corresponding to the selected one representation as descriptive of the data item generated by the generative machine learning model.
6 . The method of claim 5 , further comprising: receiving, at the metadata identification system, feedback regarding the provided metadata for the data item; and updating the weighting scheme according to the received feedback.
7 . The method of claim 5 , further comprising: receiving one or more similarity parameters; updating the weighting scheme according to the one or more similarity parameters; generating new respective final similarity scores between the data item and the one or more representations of the plurality of data items according to the updated weighting scheme; selecting the one representation or another one of the one or more representations according to the respective new final similarity scores between the data item and the one or more representations of the plurality of data items; and providing the metadata corresponding to the selected one representation or further metadata corresponding to the other one representation as descriptive of the data item generated by the generative machine learning model.
8 . The method of claim 5 , wherein the weighting scheme is determined based, at least in part, on one or more similarity parameters received at the metadata identification system for performing a similarity search for relevant metadata for the data item.
9 . The method of claim 5 , wherein the respective final similarity scores are generated according to a weighted average of the similarities.
10 . The method of claim 5 , wherein one of the different similarity techniques is token-based similarity technique that generates tokens of the data item for comparison with token representations of the plurality of data items.
11 . The method of claim 5 , wherein one of the different similarity techniques is semantic similarity technique that generates an embedding of text for comparison with embeddings of the plurality of data items.
12 . The method of claim 5 , wherein further metadata for another one of the one or more representations of the plurality of data items is provided based on the respective final similarity scores, and wherein the metadata and the further metadata are ordered in a display for the data item according to the respective final similarity scores.
13 . The method of claim 5 , wherein the metadata identification system is implemented as part of a provider network service for text generated by the provider network service.
14 . One or more non-transitory, computer-readable storage media, storing program instructions that when executed on or across one or more computing devices cause the one or more computing devices to implement: receiving a request to search for metadata relevant to a text generated by a large language model (LLM) trained using a machine learning technique applied to training data comprising a plurality of texts; identifying a weighting scheme for performing similarity analysis with respect to the text generated by the LLM; causing performance of a plurality of different similarity analysis techniques that compare the text with one or more representations of the plurality of texts of the training data set; generating respective final similarity scores between the text and the one or more representations of the plurality of texts according to the weighting scheme, wherein the weighting scheme indicates respective weights for combining individual similarity scores generated by the different similarity analysis techniques into the respective final similarity scores; selecting one representation of the one or more representations according to the respective final similarity scores between the text and the one or more representations of the plurality of texts; and returning metadata corresponding to the selected one representation as descriptive of the text generated by the LLM.
15 . The one or more non-transitory, computer-readable storage media of claim 14 , storing further programming instructions that when executed, cause the one or more computing devices to further implement: receiving feedback regarding the returned metadata for the text; and updating the weighting scheme according to the received feedback.
16 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein one of the different similarity techniques is a structure-based similarity technique that generates graph structure of the text for comparison with graph structure representations of the plurality of texts.
17 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein further metadata for another one of the one or more representations of the plurality of texts is provided based on the respective final similarity scores, and wherein the metadata and the further metadata are refined according to one or more relevancy parameters.
18 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein the weighting scheme is determined based, at least in part, on one or more similarity parameters received at a metadata identification system for performing a similarity search for relevant metadata for the text.
19 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein the text is code and wherein one of the similarity techniques is a version control similarity technique that compares a summary generated of the code with descriptions of committed code changes.
20 . The one or more non-transitory, computer-readable storage media of claim 14 , wherein the one or more computing devices are implemented as part of a code development service of a provider network and wherein the text is code generated by the code development service.

Description

BACKGROUND Large language models (LLMs) and other generative machine learning models expand the capabilities of different systems to interact with and respond to text and other data items across a wide variety of subjects. For instance, to provide competency across a number of subjects, LLMs are trained using large amounts of text data. Accordingly, text generated by LLMs may draw upon many different sources. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a logical block diagram illustrating identifying metadata descriptive of a data item generated by a generative machine learning model, according to some embodiments. FIG. 2 is a logical block diagram illustrating a provider network that implements different services including a code development service that may implement identifying metadata descriptive of code generated by a large language model, according to some embodiments. FIG. 3 is a logical block diagram illustrating interactions to request metadata for generated code, according to some embodiments. FIG. 4 is a logical block diagram illustrating generated code metadata identification, according to some embodiments. FIG. 5 illustrates an example of a weighted average indexing scheme, according to some embodiments. FIGS. 6A-6B are logical block diagrams illustrating different scenarios for reinforcement learning to train weighting schemes, in some embodiments. FIGS. 7A-7B are example user interfaces for providing identified metadata for generated code, according to some embodiments. FIG. 8 is a high-level flowchart illustrating techniques and methods to implement identifying metadata descriptive of a data item generated by a generative machine learning model, according to some embodiments. FIG. 9 is a high-level flowchart illustrating techniques and methods to implement determining a weighting scheme for similarity analyses, according to some embodiments. FIG. 10 is a block diagram illustrating an example computing system, according to some embodiments. While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that the embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including, but not limited to. DETAILED DESCRIPTION Various techniques for identifying provenance information of a data item generated by a generative machine learning model are described herein. Generative machine learning models refer to machine learning techniques that model different types of data in order to perform various data generative tasks given a prompt. For example, language models, such as Large language models (LLMs) are one type of generative machine learning model that refer to machine learning techniques applied to model language, which may include natural language (e.g., human speech) and machine-readable language (e.g., programming languages, scripts, code representations, etc.). A language model is a type of artificial intelligence (AI) model that is trained on textual data to generate coherent and contextually relevant text. A “large” language model refers to a language model that has been trained on an extensive dataset and has a high number of parameters, enabling them to capture complex language patterns and perform a wider range of tasks. Large language models are designed to handle a wide range of natural language processing tasks, such as text completion, translation, summarization, and even conversation. The specific parameter count required for a model to be considered a “large” language model can vary depending on context and technological advancements. However, traditionally, large language models have millions to billions of parameters. Language models may take inputs of language prompts (potentially with additional relevant data) and generate corresponding language outputs. Language models are widely adaptable to many different language processing scenarios. For example, a language model can be trained to translate a given input text from one language to another. In another example, a language model could be trained to summarize, analyze, or other perform other language processing tasks that generate output language based on given input language, such as chatting or following instructions. Some language models ca