US-12619428-B1 - Annotations for developers

US12619428B1US 12619428 B1US12619428 B1US 12619428B1US-12619428-B1

Abstract

Techniques are described herein for translating source code in one programming language to source code in another programming language using machine learning. A method includes: receiving first source code in a first higher-level programming language; processing the first source code, or an intermediate representation thereof, using a sequence-to-sequence neural network model to generate a sequence of outputs, each including a probability distribution; generating second source code in a second higher-level programming language by, for each output in the sequence of outputs: determining a highest probability in the probability distribution associated with the output; in response to the highest probability exceeding a first threshold, generating a predicted portion of the second source code based on a token that corresponds to the highest probability; and in response to the highest probability not exceeding the first threshold, generating a placeholder; and outputting the second source code.

Inventors

Rishabh Singh
Artem Goncharuk
Karen Davis
David Andre

Assignees

GOOGLE LLC

Dates

Publication Date: 20260505
Application Date: 20230928

Claims (20)

1 . A method implemented by one or more processors, the method comprising: receiving first source code in a first higher-level programming language; processing the first source code, or an intermediate representation of the first source code, using a neural network model to generate a sequence of outputs, wherein each output in the sequence of outputs comprises a probability distribution; generating second source code in a second higher-level programming language by, for each output in the sequence of outputs, generating a predicted portion of the second source code based on the probability distribution, wherein in response to a highest probability in the probability distribution not exceeding a threshold, the predicted portion of the second source code is determined to be a low-confidence translation; outputting the second source code; for at least one of the predicted portions of the second source code, receiving confirmation that the predicted portion of the second source code is a correct translation of a corresponding portion of the first source code; and retraining the neural network model by using in a feedback loop the confirmation that the predicted portion of the second source code is the correct translation.
2 . The method according to claim 1 , further comprising compiling the second source code using a compiler for the second higher-level programming language to generate a compiled representation of the second source code.
3 . The method according to claim 1 , further comprising: for an additional one of the predicted portions of the second source code, receiving a rejection indicating that the additional one of the predicted portions of the second source code is an incorrect translation of a corresponding portion of the first source code; and retraining the neural network model by using in a feedback loop the rejection indicating that the additional one of the predicted portions of the second source code is the incorrect translation.
4 . The method according to claim 1 , further comprising: for an additional one of the predicted portions of the second source code, receiving a rejection indicating that the additional one of the predicted portions of the second source code is an incorrect translation of a corresponding portion of the first source code, and receiving a replacement portion of the second source code; replacing the additional one of the predicted portions of the second source code with the replacement portion of the second source code; and retraining the neural network model by using in a feedback loop the rejection indicating that the additional one of the predicted portions of the second source code is the incorrect translation and the replacement portion of the second source code.
5 . The method according to claim 1 , wherein the processing is performed on the intermediate representation of the first source code, and further comprising generating the intermediate representation of the first source code by compiling the first source code using a compiler for the first higher-level programming language.
6 . The method according to claim 1 , further comprising, for each of the predicted portions of the second source code that are determined to be the low-confidence translation, providing, in a user interface, a visual indication that the predicted portion of the second source code is the low-confidence translation.
7 . The method according to claim 6 , wherein the visual indication is based on a confidence level that the predicted portion of the second source code is a correct translation of a corresponding portion of the first source code.
8 . A computer program product comprising one or more non-transitory computer-readable storage media having program instructions collectively stored on the one or more non-transitory computer-readable storage media, the program instructions executable to: receive first source code in a first higher-level programming language; process the first source code, or an intermediate representation of the first source code, using a neural network model to generate a sequence of outputs, wherein each output in the sequence of outputs comprises a probability distribution; generate second source code in a second higher-level programming language by, for each output in the sequence of outputs, generating a predicted portion of the second source code based on the probability distribution, wherein in response to a highest probability in the probability distribution not exceeding a threshold, the predicted portion of the second source code is determined to be a low-confidence translation; output the second source code; for at least one of the predicted portions of the second source code, receive confirmation that the predicted portion of the second source code is a correct translation of a corresponding portion of the first source code; and retrain the neural network model by using in a feedback loop the confirmation that the predicted portion of the second source code is the correct translation.
9 . The computer program product according to claim 8 , wherein the program instructions are further executable to compile the second source code using a compiler for the second higher-level programming language to generate a compiled representation of the second source code.
10 . The computer program product according to claim 8 , wherein the program instructions are further executable to: for an additional one of the predicted portions of the second source code, receive a rejection indicating that the additional one of the predicted portions of the second source code is an incorrect translation of a corresponding portion of the first source code; and retrain the neural network model by using in a feedback loop the rejection indicating that the additional one of the predicted portions of the second source code is the incorrect translation.
11 . The computer program product according to claim 8 , wherein the program instructions are further executable to: for an additional one of the predicted portions of the second source code, receive a rejection indicating that the additional one of the predicted portions of the second source code is an incorrect translation of a corresponding portion of the first source code, and receive a replacement portion of the second source code; replace the additional one of the predicted portions of the second source code with the replacement portion of the second source code; and retrain the neural network model by using in a feedback loop the rejection indicating that the additional one of the predicted portions of the second source code is the incorrect translation and the replacement portion of the second source code.
12 . The computer program product according to claim 8 , wherein the processing is performed on the intermediate representation of the first source code, and further comprising generating the intermediate representation of the first source code by compiling the first source code using a compiler for the first higher-level programming language.
13 . The computer program product according to claim 8 , wherein the program instructions are further executable to, for each of the predicted portions of the second source code that are determined to be the low-confidence translation, provide, in a user interface, a visual indication that the predicted portion of the second source code is the low-confidence translation.
14 . The computer program product according to claim 13 , wherein the visual indication is based on a confidence level that the predicted portion of the second source code is a correct translation of a corresponding portion of the first source code.
15 . A system comprising: a processor, a computer-readable memory, one or more non-transitory computer-readable storage media, and program instructions collectively stored on the one or more non-transitory computer-readable storage media, the program instructions executable to: receive first source code in a first higher-level programming language; process the first source code, or an intermediate representation of the first source code, using a neural network model to generate a sequence of outputs, wherein each output in the sequence of outputs comprises a probability distribution; generate second source code in a second higher-level programming language by, for each output in the sequence of outputs, generating a predicted portion of the second source code based on the probability distribution, wherein in response to a highest probability in the probability distribution not exceeding a threshold, the predicted portion of the second source code is determined to be a low-confidence translation; output the second source code; for at least one of the predicted portions of the second source code, receive confirmation that the predicted portion of the second source code is a correct translation of a corresponding portion of the first source code; and retrain the neural network model by using in a feedback loop the confirmation that the predicted portion of the second source code is the correct translation.
16 . The system according to claim 15 , wherein the program instructions are further executable to compile the second source code using a compiler for the second higher-level programming language to generate a compiled representation of the second source code.
17 . The system according to claim 15 , wherein the program instructions are further executable to: for an additional one of the predicted portions of the second source code, receive a rejection indicating that the additional one of the predicted portions of the second source code is an incorrect translation of a corresponding portion of the first source code; and retrain the neural network model by using in a feedback loop the rejection indicating that the additional one of the predicted portions of the second source code is the incorrect translation.
18 . The system according to claim 15 , wherein the program instructions are further executable to: for an additional one of the predicted portions of the second source code, receive a rejection indicating that the additional one of the predicted portions of the second source code is an incorrect translation of a corresponding portion of the first source code, and receive a replacement portion of the second source code; replace the additional one of the predicted portions of the second source code with the replacement portion of the second source code; and retrain the neural network model by using in a feedback loop the rejection indicating that the additional one of the predicted portions of the second source code is the incorrect translation and the replacement portion of the second source code.
19 . The system according to claim 15 , wherein the processing is performed on the intermediate representation of the first source code, and further comprising generating the intermediate representation of the first source code by compiling the first source code using a compiler for the first higher-level programming language.
20 . The system according to claim 15 , wherein the program instructions are further executable to, for each of the predicted portions of the second source code that are determined to be the low-confidence translation, provide, in a user interface, a visual indication that the predicted portion of the second source code is the low-confidence translation.

Description

BACKGROUND Computer software programming often requires developers to read and/or write source code (i.e., to program) in a specific higher-level programming language. Some non-limiting examples of higher-level programming languages include Java, C++, C, Python, Perl, etc.—each of which can have its own strengths, weaknesses, nuances, idiosyncrasies, etc. Many programmers obtain at least a superficial understanding of multiple programming languages but only master a few. Consequently, this can cause problems when an entity (e.g., a company) wants to translate code from a base higher-level programming language to a different target higher-level programming language. For example, existing programmers at the entity may lack expertise in the target programming language and be unable to manually translate the code and/or can be highly inefficient in doing so. The inefficiencies can lead to excess usage of client device resources utilized in translating the code. Put another way, the inefficiencies can result in a client device, being used in manually translating the code, to be on and/or in a higher-powered state for prolonged periods. Even if new programmer(s) familiar with the target language were brought in for the translating, manually translating is nonetheless still inefficient, at least due to the new programmer(s) being unfamiliar with the semantics of the base code being translated. Even outside of the automatic translating context, excess usage of client device resources can also occur when programmers attempt to code in a new language with which they have lesser expertise relative to other language(s). This can be due to the programmers being slower when coding in the new language which, in turn, prolongs the duration that client device resource(s) need to be active when coding in the new language. SUMMARY Implementations disclosed herein relate to utilization of machine learning model(s) in automatically translating source code in a “base” programming language to source code in another programming language, or “target” programming language. The machine learning models used to translate source code may include, e.g., neural network models, neural network ensembles, model pipelines including a first source code to first embedding to second embedding to second source code pipeline, etc. Implementations disclosed herein can enable automatic translation or partial translation of source code from the base programming language to the target programming language, while mitigating the amount of programmer time (and corresponding client device usage) that is involved. For example, some or all source code of a program can be translated from the base language to the target language, without requiring any human intervention. For instance, translated target language source code can optionally be presented to a programmer for review and potential editing, but the programmer will not be involved in the initial generation of the translated target language segment. Implementations disclosed herein can additionally or alternatively enable programmers who might be unfamiliar with a base programming language to nonetheless view and/or edit source code written in the base language by translating the source code to another programming language that is more familiar to the programmer. In automatically translating source code of a program, programmed in a base programming language (e.g., C++), an intermediate representation can be generated. In some implementations, the intermediate representation can be a lower-level representation generated using a compiler for the base programming language that generates a lower-level compiled representation of the source code. As used herein, the “lower-level representation” can refer to bytecode, object code, binary code, assembly code, abstract syntax trees, or any other representation of source code that is less human-readable than source code from which the lower-level representation was generated. In other implementations, the intermediate representation can be a natural language intermediate representation. For example, a machine learning model can be used to translate the source code programmed in the base programming language to a natural language intermediate representation. For instance, the machine learning model can be trained based on training instances with base source code, natural language pairs. The natural language paired with a corresponding instance of source code in a training instance can be, for example, natural language that conforms to docstring(s) for the instance of source code or is based on such docstring(s) (e.g., a variant that omits and/or replace(s) term(s) of the docstring(s)). In other implementations, the natural language paired with a corresponding instance of source code in a training instance can be, for example, comments in the source code and/or other types of documentation, e.g., comments to changes or commits in a source control system or versio