US-12619406-B2 - Cloud native auto-labeling system to train code generation models
Abstract
Systems and methods are disclosed that deploy software code from a dataset into a computing environment. The systems and method collect energy metrics of the software code while executing in the computing environment. The systems and methods determine a sustainability label for the software code based on the energy metrics. The systems and methods assign the sustainability label to the software code to produce a sustainability-based dataset.
Inventors
- Huamin Chen
- Chen Wang
Assignees
- RED HAT, INC.
Dates
- Publication Date
- 20260505
- Application Date
- 20230824
Claims (20)
- 1 . A method comprising: deploying software code from a dataset into a computing environment; collecting energy metrics of the software code while executing in the computing environment; determining, by a processing device, a sustainability label for the software code based on the energy metrics; assigning the sustainability label to the software code to produce a sustainability-based dataset; and training a large language model (LLM) using the sustainability-based dataset to produce a trained sustainability-based LLM.
- 2 . The method of claim 1 , wherein the software code is a new release of the software code and the sustainability label is a new sustainability label, the method further comprising: computing a new sustainability value of the new release of software code based on the energy metrics, wherein the energy metrics indicate a performance per unit of time of the new release of software code while executing in the computing environment; and determining the new sustainability label of the new release of software code based on comparing the new sustainability value with one or more previous sustainability values corresponding to one or more previous releases of the software code.
- 3 . The method of claim 2 , further comprising: adding a new table entry comprising the new sustainability label and corresponding to the new release of software code into a sustainability table, wherein the sustainability table further comprises one or more previous table entries corresponding to the one or more previous releases of the software code and their corresponding one or more previous sustainability labels.
- 4 . The method of claim 2 , further comprising: identifying an amount of actions executed by the new release of the software code that consume a joule of energy per unit of time; and using the identified amount of actions to compute the new sustainability value.
- 5 . The method of claim 2 , further comprising: ranking the new release of software code with the one or more previous releases of the software code based on their corresponding sustainability values; assigning the new sustainability label to the new release of software code based on the ranking; and assigning one or more updated sustainability labels to the one or more previous releases of software code based on the ranking.
- 6 . The method of claim 1 , wherein, prior to deploying the software code, the method further comprises: extracting code collection information from a playbook, wherein the playbook comprises a scripted automation configuration that outlines one or more tasks to be executed; and identifying the software code based on the at least one of the one or more tasks.
- 7 . The method of claim 6 , wherein, prior to deploying the software code, a plurality of approved playbooks comprising the playbook are utilized by the LLM.
- 8 . A system comprising: a processing device; and a memory to store instructions that, when executed by the processing device cause the processing device to: deploy software code from a dataset into a computing environment; collect energy metrics of the software code while executing in the computing environment; determine a sustainability label for the software code based on the energy metrics; assign the sustainability label to the software code to produce a sustainability-based dataset; and train a large language model (LLM) using the sustainability-based dataset to produce a trained sustainability-based LLM.
- 9 . The system of claim 8 , wherein the software code is a new release of the software code and the sustainability label is a new sustainability label, and wherein the processing device, responsive to executing the instructions, further causes the system to: compute a new sustainability value of the new release of software code based on the energy metrics, wherein the energy metrics indicate a performance per unit of time of the new release of software code while executing in the computing environment; and determine the sustainability label of the new release of software code based on comparing the new sustainability value with one or more previous sustainability values corresponding to one or more previous releases of the software code.
- 10 . The system of claim 9 , wherein the processing device, responsive to executing the instructions, further causes the system to: add a new table entry comprising the new sustainability label and corresponding to the new release of software code into a sustainability table, wherein the sustainability table further comprises one or more previous table entries corresponding to the one or more previous releases of the software code and their corresponding one or more previous sustainability labels.
- 11 . The system of claim 9 , wherein the processing device, responsive to executing the instructions, further causes the system to: identify an amount of actions executed by the new release of the software code that consume a joule of energy per unit of time; and use the identified amount of actions to compute the new sustainability value.
- 12 . The system of claim 9 , wherein the processing device, responsive to executing the instructions, further causes the system to: rank the new release of software code with the one or more previous releases of the software code based on their corresponding sustainability values; assign the new sustainability label to the new release of software code based on the ranking; and assign one or more updated sustainability labels to the one or more previous releases of software code based on the ranking.
- 13 . The system of claim 8 , wherein, prior to deploying the software code, the processing device, responsive to executing the instructions, further causes the system to: extract code collection information from a playbook, wherein the playbook comprises a scripted automation configuration that outlines one or more tasks to be executed; and identify the software code based on the at least one of the one or more tasks.
- 14 . The system of claim 13 , wherein, prior to deploying the software code, a plurality of approved playbooks comprising the playbook are utilized by the LLM.
- 15 . A non-transitory computer readable medium, having instructions stored thereon which, when executed by a processing device, cause the processing device to: deploy software code from a dataset into a computing environment; collect energy metrics of the software code while executing in the computing environment; determine, by the processing device, a sustainability label for the software code based on the energy metrics; assign the sustainability label to the software code to produce a sustainability-based dataset; and train a large language model (LLM) using the sustainability-based dataset to produce a trained sustainability-based LLM.
- 16 . The non-transitory computer readable medium of claim 15 , wherein the software code is a new release of the software code and the sustainability label is a new sustainability label, and wherein the processing device is configured to: compute a new sustainability value of the new release of software code based on the energy metrics, wherein the energy metrics indicate a performance per unit of time of the new release of software code while executing in the computing environment; and determine the sustainability label of the new release of software code based on comparing the new sustainability value with one or more previous sustainability values corresponding to one or more previous releases of the software code.
- 17 . The non-transitory computer readable medium of claim 16 , wherein the processing device is configured to: add a new table entry comprising the new sustainability label and corresponding to the new release of software code into a sustainability table, wherein the sustainability table further comprises one or more previous table entries corresponding to the one or more previous releases of the software code and their corresponding one or more previous sustainability labels.
- 18 . The non-transitory computer readable medium of claim 16 , wherein the processing device is configured to: identify an amount of actions executed by the new release of the software code that consume a joule of energy per unit of time; and use the identified amount of actions to compute the new sustainability value.
- 19 . The non-transitory computer readable medium of claim 16 , wherein the processing device is configured to: rank the new release of software code with the one or more previous releases of the software code based on their corresponding sustainability values; assign the new sustainability label to the new release of software code based on the ranking; and assign one or more updated sustainability labels to the one or more previous releases of software code based on the ranking.
- 20 . The non-transitory computer readable medium of claim 15 , wherein, prior to deploying the software code, the processing device is configured to: extract code collection information from a playbook, wherein the playbook comprises a scripted automation configuration that outlines one or more tasks to be executed; and identify the software code based on the at least one of the one or more tasks.
Description
TECHNICAL FIELD Aspects of the present disclosure relate to dataset labeling, and more particularly, to an approach of auto-labeling software code to produce a sustainability-based dataset. BACKGROUND Large language models (LLMs) employ advanced neural network architectures to understand, generate, and manipulate human language with a high degree of proficiency. Large Language Models (LLMs) utilize datasets to train their neural networks, enabling them to learn language patterns, semantics, and contextual relationships necessary for understanding and generating coherent and contextually relevant responses. Datasets are collections of structured or unstructured data that serve as the foundation for training LLMs, providing examples that enable the models to learn patterns and generate predictions. In addition to text, images, and other forms of data, datasets can also encompass software code snippets, enhancing the capability of LLMs to understand and generate programming-related content. BRIEF DESCRIPTION OF THE DRAWINGS The described embodiments and the advantages thereof may best be understood by reference to the following description taken in conjunction with the accompanying drawings. These drawings in no way limit any changes in form and detail that may be made to the described embodiments by one skilled in the art without departing from the spirit and scope of the described embodiments. FIG. 1 is a block diagram that illustrates an example system, in accordance with some embodiments of the present disclosure. FIG. 2 is a flow diagram of a method for collecting energy metrics of software code, determining a sustainability label for the software code, and assigning the sustainability label to the software code, in accordance with some embodiments of the present disclosure. FIG. 3A is a block diagram that illustrates an example history table for tracking software code releases and their corresponding sustainability information, in accordance with some embodiments of the present disclosure. FIG. 3B is a block diagram that illustrates an example sustainability-based dataset table that includes software code releases and their corresponding sustainability labels, in accordance with some embodiments of the present disclosure. FIG. 4 is a flow diagram of a method for assigning a sustainability label to software code to produce a sustainability-based dataset, in accordance with some embodiments of the present disclosure. FIG. 5 is a block diagram that illustrates an example system for producing a sustainability-based dataset based on collected software code energy metrics, in accordance with some embodiments of the present disclosure. FIG. 6 is a flow diagram of a method for assigning sustainability labels to software code releases based on sustainability information corresponding to previous releases of the software code, in accordance with some embodiments of the present disclosure. FIG. 7 is a block diagram that illustrates an approach of ranking software code releases based on sustainability information relative to their corresponding previous releases of software code, in accordance with some embodiments of the present disclosure. FIG. 8 is a block diagram that illustrates an example system for ranking software code releases based on sustainability information relative to their corresponding previous releases of software code, in accordance with some embodiments of the present disclosure. FIG. 9 is a block diagram of an example computing device that may perform one or more of the operations described herein, in accordance with some embodiments of the present disclosure. DETAILED DESCRIPTION In the rapidly evolving landscape of artificial intelligence and natural language processing, Large Language Models (LLMs) have emerged as pivotal tools for understanding and generating human-like text. A key factor driving the effectiveness of LLMs is their ability to leverage expansive and diverse datasets during both training and inference stages. As discussed above, LLMs are trained using extensive datasets that encompass a wide array of data types. These datasets serve as instructional material that allows an LLM to learn intricate linguistic patterns, grammatical structures, and contextual relationships present in human language. The training process involves iteratively presenting the LLM with examples from the dataset and adjusting its internal parameters to minimize the disparity between its predictions and the actual text. This enables the LLM to acquire a nuanced understanding of language semantics, enhancing its capacity to generate coherent and contextually appropriate responses. One of the notable capabilities of modern LLMs is that of code generation models to produce responses that include suggestions for software code. By integrating programming code snippets within their training datasets, these models offer insightful and contextually relevant code suggestions to user queries. For example, a query re