EP-4273724-B1 - DETECTING ANOMALIES IN CODE COMMITS

EP4273724B1EP 4273724 B1EP4273724 B1EP 4273724B1EP-4273724-B1

Inventors

DUGGAN, Neil David Jonathan
MARCOVECCHIO, Vincenzo Kazimierz

Dates

Publication Date: 20260513
Application Date: 20230328

Claims (11)

A method (200), comprising: obtaining (202), by a server, one or more attribute values associated with one or more code commits of source code, wherein the one or more attribute values comprises at least expected reviewer associated with the one or more code commits; generating, by the server, a backdoor abstraction of the source code; and generating (204), by the server and based on the one or more attribute values and based on the backdoor abstraction of the source code, an anomaly report indicating a risk level of the source code.
The method (200) of claim 1, wherein the one or more attribute values further comprises at least one of: time of the one or more code commits; files modified by the one or more code commits; or a quantity of files modified by the one or more code commits during a configured period.
The method (200) of claim 1 or 2, wherein generating (204) the anomaly report comprises: inputting, by the server, the one or more attribute values into a machine learning model to determine whether the one or more code commits comprise an anomaly, wherein the machine learning model is trained using a plurality of samples comprising attribute values associated with code commits; and in response to determining that the one or more code commits comprise the anomaly, including the anomaly in the anomaly report.
The method (200) of any preceding claim, wherein the backdoor abstraction of the source code is an Extensible Markup Language or a JavaScript Object Notation file.
The method (200) of any preceding claim, wherein generating the backdoor abstraction comprises: identifying one or more library calls indicative of a potential backdoor in the source code; and including, in the backdoor abstraction, a potential backdoor representation corresponding to the one or more library calls.
The method (200) any preceding claim, wherein generating (204) the anomaly report comprises: inputting, by the server, the one or more attribute values and the backdoor abstraction into a machine learning model to determine whether the source code comprises an anomaly, wherein the machine learning model is trained using a plurality of samples comprising sample backdoor abstractions and attribute values associated with code commits; and in response to determining that the source code comprises the anomaly, including the anomaly in the anomaly report.
The method (200) of any preceding claim, comprising: storing, by the server, the backdoor abstraction of the source code as a baseline; obtaining, by the server, additional source code; generating, by the server, an additional backdoor abstraction of the additional source code; and generating, by the server, an additional anomaly report based on the baseline and the additional backdoor abstraction.
The method (200) of claim 7, wherein the additional source code is a later version of the source code.
A computer-readable medium containing instructions which, when executed, cause a computing device to perform the method (200) of any preceding claim.
A computer-implemented system, comprising: one or more computers; and one or more computer memory devices interoperably coupled with the one or more computers and having tangible, non-transitory, machine-readable media storing one or more instructions that, when executed by the one or more computers, perform the method (200) of any one of claims 1 to 8.
A computer program which, when executed on one or more processors of one or more computing devices is configured to cause the one or more computers to carry out the method (200) of any one of claims 1 to 8.

Description

TECHNICAL FIELD The present disclosure relates to detecting anomalies in code commits. BACKGROUND In some cases, software services can be provided by compiling and executing source code. The source code is computer software in a human-readable programming language. The computer software can be an application software, a system software (e.g., an operating system or a device driver), or a component thereof. The source code can be transformed by an assembler or a compiler into binary code that can be executed by a computer. The source code can be logically divided into multiple source files. US2018239898 A1 discloses detecting anomalous modifications to a software component. US2019227902 A1 discloses a classification machine learning model trained to predict the likelihood that a software program is likely to have a software bug in the future. US2021056209 A1 discloses a processor configured to determine whether a new version of a component presents an unusual risk profile, based on historical behavioral analysis. DESCRIPTION OF DRAWINGS FIG. 1 is a schematic diagram showing an example communication system that detects anomalies in code commits, according to an implementation.FIG. 2 is a flowchart showing an example method for detecting anomalies in code commits, according to an implementation.FIG. 3 is a high-level architecture block diagram of a computer according to an implementation. Like reference numbers and designations in the various drawings indicate like elements. SUMMARY Accordingly there is provided a method, a computer readable medium, a computer program, and a system as detailed in the claims that follow. DETAILED DESCRIPTION In some implementations, a malicious insider with legitimate access to a supplier's code repositories can deliberately insert vulnerabilities or backdoors into the source code through code commits. This means that the software release has been deliberately compromised and could then be used in a software supply chain attack. In some cases, a server can analyze behaviors of code committers and/or expected behaviors of code commits to detect anomalies in the code commits. The analysis of behaviors of code committers concerns anomalies of a code committer's behaviors. For example, attribute values associated with code commits of the code committer can be analyzed to detect anomalous code commit behaviors. The analysis of the expected behaviors of code commits concerns what the commit code does in software functionality terms. For example, the committed source code can be analyzed to identify potential backdoors. Techniques described herein produce one or more technical effects. In some cases, the techniques can enhance security of the source code by identifying anomalies in code commits. For example, the techniques can leverage behaviors of code committers and/or expected behaviors of code commits to detect anomalies in the code commits. By detecting unusual patterns in the behaviors of code committers and/or in the expected behaviors of code commits, the techniques can enhance accuracies of anomaly identifications. In some cases, the techniques can improve efficiencies of detecting anomalies in code commits. For one example, the techniques do not try to review the source code for potential risks in a blanket way, which is likely time consuming and can produce a high number of false positives. Instead, the techniques can identify potential risks based on anomalous behaviors of code committers and/or based on knowledge of the characteristics of backdoors to classify the potential backdoors as an abstraction. Thus, the speed for identifying potential risks in source code is increased and the number of false positives is reduced. For another example, the techniques can maintain a reliable baseline of a source code when the source code has a high degree of confidence or assurance. When a later version of the source code is to be checked for potential risks, only the relative differences between the later version of the source code and the reliable baseline need to be checked. By checking the relative differences instead of the entire newer version of source code, the techniques can save time in risk identifications. Further, the techniques described for anomaly detection lend themselves to being done at the actual time of the code commit or retrospectively on a single or series of commits. When the techniques are done retrospectively on a series of commits, this may be done from a reliable baseline as referred to above. FIG. 1 is a schematic diagram showing an example communication system 100 that provides data communications for detecting anomalies in code commits, according to an implementation. At a high level, the example communication system 100 includes a software developer device 102 that is communicatively coupled with a software service platform 106 and a client device 108 over a network 110. In some cases, the software developer device 102 can be part of a software devel