JP-7855727-B2 - System and Method for Program Synthesis

JP7855727B2JP 7855727 B2JP7855727 B2JP 7855727B2JP-7855727-B2

Inventors

レ，フン
ワン，ユー
ゴットマーレ，アキレスディーパック
ホー，チュホン

Assignees

セールスフォースインコーポレイテッド

Dates

Publication Date: 20260508
Application Date: 20230519
Priority Date: 20220826

Claims (19)

A processor is a method for performing reinforcement learning-based training for program synthesis based on a reinforcement learning framework, The input interface receives the problem specification and the corresponding solution program, Based on the aforementioned problem specification and the corresponding solution program, the pre-trained language model is fine-tuned, The finely tuned pre-trained language model generates a sampled program in response to the problem specification during the decoding time step. The critical model generates a return indicating the functional accuracy of the sampled program based on a comparison of the execution results of the sampled program with the test results of the problem specification, Given the current parameters of the finely tuned pre-trained language model, the policy gradient of the expected value of the return is calculated. Updating the finely tuned pre-trained language model according to the aforementioned policy gradient, Methods that include...
The sampled program, In the decoding time step, predictive tokens for the sampled program are generated, which are governed by the current parameters of the finely tuned pre-trained language model. Updating the hidden state representation of the finely tuned pre-trained language model, In the next decoding time step, the updated hidden state representation of the fine-tuned pre-trained language model is used to generate the next predicted token of the sampled program, which is generated by: The method according to claim 1.
The method according to claim 2, wherein the return is generated when an exit token is generated for the sampled program.
The aforementioned return is, Passing the sampled program and tests to the compiler, The first return value is determined depending on whether the sampled program was successfully compiled and executed, and whether the execution result matches the test result of the problem specification. The method according to claim 1, which is produced by
The method according to claim 4, wherein the policy gradient is calculated as an estimate based on the return and the gradient of the conditional probabilities of the previous prediction token and the prediction token conditioned by the problem specification.
Inputting the baseline program generated by the base model in response to the aforementioned problem specification into the critic model, The second return value is determined depending on whether the baseline program was successfully compiled and executed, and whether the execution result matches the test result of the problem specification. The method according to claim 4, further comprising:
The method according to claim 6, wherein the policy gradient is calculated based on the difference between the first return value and the second return value, and the gradient of the conditional probabilities of the previous prediction token and the prediction token conditioned by the problem specification.
The aforementioned Critic Model, The Critic Model receives the training sequence of the problem specification and the sampled program, The aforementioned critical model generates predicted test outcomes corresponding to the sampled programs, The cross-entropy loss is calculated by comparing the predicted test outcome with the execution results of the sampled program, Updating the Critic model based on the cross-entropy loss, The method according to claim 1, which is trained by
The method according to claim 8, wherein the predicted test outcome is calculated by a softmax operation of the maximum value pooled context hidden state of the decoder in the critical model.
The method according to claim 9, wherein the policy gradient is calculated based on the probability distribution of the predicted test outcome generated by the critical model and the gradient of the conditional probabilities of the predicted tokens, conditioned by the previously predicted tokens and the problem specification.
A reinforcement learning-based training system for program synthesis based on a reinforcement learning framework, An input interface that receives a problem specification in a language model pre-trained for program synthesis, Memory that stores multiple processor-executable instructions, The aforementioned plurality of processor-executable instructions are read and executed, The input interface receives the problem specification and the corresponding solution program, Based on the aforementioned problem specification and the corresponding solution program, the pre-trained language model is fine-tuned, The finely tuned pre-trained language model generates a sampled program in response to the problem specification during the decoding time step. The critical model generates a return indicating the functional accuracy of the sampled program based on a comparison of the execution results of the sampled program with the test results of the problem specification, Given the current parameters of the finely tuned pre-trained language model, the policy gradient of the expected value of the return is calculated. Updating the finely tuned pre-trained language model according to the aforementioned policy gradient, A processor that performs operations including, A system that includes this.
The sampled program, In the decoding time step, predictive tokens for the sampled program are generated, which are governed by the current parameters of the finely tuned pre-trained language model. Updating the hidden state representation of the finely tuned pre-trained language model, In the next decoding time step, the updated hidden state representation of the fine-tuned pre-trained language model is used to generate the next predicted token of the sampled program, and the tokens generated are: The system according to claim 11.
The system according to claim 12, wherein the return is generated when an exit token is generated for the sampled program.
The aforementioned return is, Passing the sampled program and tests to the compiler, The first return value is determined depending on whether the sampled program was successfully compiled and executed, and whether the execution result matches the test result of the problem specification. The system according to claim 11, which is generated by...
The system according to claim 14, wherein the policy gradient is calculated as an estimate based on the return and the gradient of the conditional probabilities of the previous prediction token and the prediction token conditioned by the problem specification.
Inputting the baseline program generated by the base model in response to the aforementioned problem specification into the critic model, The second return value is determined depending on whether the baseline program was successfully compiled and executed, and whether the execution result matches the test result of the problem specification. The system according to claim 14, further comprising:
The system according to claim 16, wherein the policy gradient is calculated based on the difference between the first return value and the second return value and the gradient of the conditional probabilities of the previous prediction token and the prediction token conditioned by the problem specification.
The aforementioned Critic Model, The Critic Model receives the training sequence of the problem specification and the sampled program, The aforementioned critical model generates predicted test outcomes corresponding to the sampled programs, The cross-entropy loss is calculated by comparing the predicted test outcome with the execution results of the sampled program, Updating the Critic model based on the cross-entropy loss, The system according to claim 11, which is trained by
The predicted test outcome is calculated by a softmax operation of the maximum value pooled context hidden state of the decoder in the critical model. The system according to claim 18, wherein the policy gradient is calculated based on the probability distribution of the predicted test outcome generated by the critical model and the gradient of the conditional probabilities of the predicted tokens, conditioned by the previously predicted tokens and the problem specification.

Description

[Cross reference] This international application claims priority under the concurrently pending U.S. Nonprovisional Applications No. 17/896,942 and No. 17/896,946, filed on 26 August 2022 by the same applicant, each of which is a nonprovisional application of U.S. Provisional Application No. 63/344,900, filed on 23 May 2022, and which claims priority under Section 119 of the U.S. Patent Act. All of the above applications are expressly incorporated herein by reference in their entirety. [Technical field] The embodiments generally relate to machine learning systems, and more specifically to systems and methods for program synthesis through pre-trained models and deep reinforcement learning. Program synthesis, also commonly referred to as code generation, is the task of generating a computer code program that satisfies a problem specification, such as sorting a list, merging two data tables, and/or similar tasks. When program synthesis is treated as a sequence-to-sequence task, some pre-trained language models can be adapted to take an input sequence as a natural language problem specification and then generate a sequence of code as an output program. However, these existing language models may have limited code generation performance because they often train the program synthesis model solely from natural language problem descriptions and ground truth programs, following standard supervised fine-tuning procedures. This paradigm largely ignores several important but potentially useful signals in the problem specification, such as unit tests, resulting in poor performance when solving complex, unfamiliar coding tasks. Therefore, an efficient and accurate program synthesis model is needed. This is a simplified block diagram illustrating an exemplary architecture that employs an actor-critic framework to optimize a pre-trained language model (LM) (and a fine-tuned LM) for program synthesis, according to embodiments described herein. This is a simplified block diagram illustrating an exemplary program synthesis task according to one embodiment described herein. This is a simplified block diagram illustrating an example of a reinforcement learning-based program synthesis framework for an exemplary program synthesis task, according to one embodiment described herein. This is a simplified block diagram illustrating an exemplary training procedure for the Critic Network shown in Figure 1, according to the embodiments described herein. This is a simplified block diagram showing a critical sampling (CS) framework for program synthesis using a trained LM (Actor Network) from Figure 1 during inference, according to embodiments described herein. These are simplified diagrams showing computing devices for implementing the reinforcement learning-based program synthesis frameworks illustrated in Figures 1 to 5, according to several embodiments. Figures 1 to 5 are simplified block diagrams showing a networked system suitable for implementing the program synthesis framework described herein and other embodiments described herein. Figure 1 is an exemplary logic flowchart illustrating a reinforcement learning-based training method for program synthesis based on the actor-critic framework shown, according to some embodiments described herein. Figure 5 is an exemplary logic flowchart illustrating a program synthesis method based on the LM shown, according to some embodiments described herein.Figure 5 shows an exemplary logic flow diagram (continued) illustrating a program synthesis method based on the LM shown in some embodiments described herein. This is an exemplary data table or chart showing exemplary performance comparisons between the program synthesis frameworks described in Figures 1 to 9 and various baseline models according to some embodiments described herein.This is an exemplary data table or chart showing exemplary performance comparisons between the program synthesis frameworks described in Figures 1 to 9 and various baseline models according to some embodiments described herein.This is an exemplary data table or chart showing exemplary performance comparisons between the program synthesis frameworks described in Figures 1 to 9 and various baseline models according to some embodiments described herein.This is an exemplary data table or chart showing exemplary performance comparisons between the program synthesis frameworks described in Figures 1 to 9 and various baseline models according to some embodiments described herein.This is an exemplary data table or chart showing exemplary performance comparisons between the program synthesis frameworks described in Figures 1 to 9 and various baseline models according to some embodiments described herein.This is an exemplary data table or chart showing exemplary performance comparisons between the program synthesis frameworks described in Figures 1 to 9 and various baseline models according to some embodiments described herein.This is an exemplary data table or chart showing